The Agents Are Coming

2023 was undoubtedly the year of AI, and the disruption shows no signs of slowing down. But 2024 appears to be turning into the year of Agents. In the last few months we’ve seen Rabbit R1, Devin AI, and Arc Search. These products appear to have these principles in mind:

  • Computers should be smarter (duh).
  • Using computers should be as simple as talking to a human.
  • The computer should just do it for me.

These innovations lead some of us to wonder:

Are apps doomed to die?

Language is the Ultimate UX

Try to remember “the future”. How did it feel to watch James Bond use a computer? When Batman was tracking down The Joker and needed help from his supercomputer, did he open the Maps app? When Rick Deckard from Blade Runner needed to zoom in on a photo to catch his suspect, did he open Photoshop? Or course not.

They just talked to their computer and the computer did what they told it to. We tried to do this 10 years ago with Siri and Alexa, and the results have been a real mixed-bag. They’re just kinda… dumb. Why? Because they are clever magic tricks. The voice assistants of the last decade have little to no actual understanding. They are preprogrammed to expect different types of requests, and respond with different types of responses. And it’s up to their human designers to anticipate and implement any type of interaction that a user might have.

But all of that changed with ChatGPT and LLMs. For the first time1, humans felt like computers genuinely understood them. I could simply talk naturally, and ChatGPT understood what I meant and responded with a genuinely thoughtful answer. It even remembered the context of our greater conversation! I didn’t need to use carefully chosen keywords like a Google search. I didn’t need to enunciate and avoid names and places with obscure spellings like I do with Alexa or Siri. And now with agents like Rabbit R1, I don’t even need to decide which app to use. I just tell the computer what I want, and it does it. Wow! Magic! 🪄

The best GUI is no GUI. Language is the ultimate UI, after all it’s the UI we’ve been using for thousands of years.

But Is Language As Great As It’s Cracked Up To Be?

Except things aren’t quite so simple, are they? When we actually use these AI products, inevitably we find rough edges. LLMs just straight up make up stuff2 some times. They’re unreliable.

Let me be clear. These are very real challenges and limitations of current AI technologies, and they mean that AI is not a viable replacement for apps… today. But clever solutions are already being developed to tackle these problems, and it’s a matter of time until many, if not all of AI’s current challenges are solved. Pretty much anything that an app can do, an AI can do as well. So, clearly apps are doomed.

Well, maybe not quite. Consider this. Try generating an image using a text to image service like Stable Diffusion or ChatGPT. Imagine something you’d like to create and tell it what you want. Voila! It created an amazing, detailed image that is exactly what you want… except well, I just want to make one small edit. The person in the picture has eyes that are too big.

Okay. ChatGPT, make the eyes smaller, please. Wait… Oh, no that’s too small. Wait… No, now that’s too big.

Do you see the problem? How do you tell the computer exactly how big something is? Maybe we can give it precise measurements.

Make the eyes 314 pixels taller.

Except, how did you, the human know how many pixels to use. Can your eyes tell the difference between 314 and 315 pixels? Nope.

I know! We could give the LLM a command to show a ruler on screen. And we could add more commands to increment the pixel size. Great! Problem solved! You know what we just created? A really lame GUI!

The Real Problem

I wasn’t kidding when I said that language is the ultimate UI. I really believe it is the inevitable software of the future. Using a computer really should be as simple as talking to a human being, but there’s one small problem. Humans aren’t always so easy to talk to. Don’t believe me? Just try driving in an unfamiliar city while a friend gives you directions.

No, your left! No, your other left.

Why is it so hard? Because there are some ideas that are difficult to express in language. Many of these ideas are simply easier to express in a GUI. This is particularly true for visual ideas.

How many words do you know for colors? A few dozen maybe? Now, how many colors can be expressed in a GUI on an iPhone? About 16.7 million!3

Pick the Right Tool For the Job

David Luan made an interesting observation about user interfaces on the Latent Space podcast. Before, computers were a text based interface. Eventually, GUIs were created, but they weren’t the primary interface. You started in a command line, and then stepped into a GUI when you wanted it. Eventually, GUIs became the primary interface and command lines became obsolete.

Except command lines aren’t obsolete. They are still on basically every operating system today. Why? Because there are a handful of use cases that are better suited for command lines.

Likewise, there are some tasks, particularly visual tasks, that simply aren’t suited to LLMs. Some of these problems (like hallucinations) are engineering problems. These will be hammered out eventually. But some of these problems (like visuals) are interface problems. You could “fix” it, but that’s like trying to fit a square peg in a round hole.

Instead, pick the right tool for the job. Use a GUI, where it fits, and an LLM where it fits. And more often than not, I think the optimal choice will be a combination.

Footnotes

See Also

Footnotes

  1. But let’s be honest. Was it really the first time?

  2. They do it so much, that a new word was coined for it: “hallucinations”.

  3. An iPhone today uses the DCI-P3 color gamut.