I had recently predicted that:
we’ll have some version of actually impactful AI agents before the end of 2025 or if they do come out at the end, they will be better than I expected1
Operator is the next step in this journey and I’m having two distinct reactions to it. A disclaimer here is that this is just rolled out to pro users (the $200 a month subscription) and i’m still just a lowly plus subscriber, so I haven’t actually gotten to test it to see how it will respond to my hypotheticals.
So What…
Let’s start with the negative perspective. The demos are basically all variations on the same theme of using a supported app to buy a thing on the internet. But the thing is that apps are already pretty well designed for buying things on the internet. I’m a long time Instacart user and my workflow is something like:
add everything from the last order to the cart
remove a couple things that we have too much of
add a few things that are consistently in the bi-weekly rotation
browse briefly to make sure I’m not forgetting anything
do a bit of impulse shopping2
This takes probably 5-10 minutes and a substantial portion of that is on the browsing side of things. With each order lasting close to two weeks, this is just not a substantial burden and I’m not even sure how much faster it could be made because some portion of that time is thinking about what I’m intending to cook over that time period.
I get why they did the demos they did. They specifically worked with some companies as a partnership and agreed to feature them like this. Also apps are a much more constrained space so the chance of a success of a pre-vetted and tested use case is far more likely to have a happy path demo success.
But I think the number of people this is solving a problem for (oh, I would have spent $500 on Laker’s tickets to go to a game tonight, but I don’t have time to be bothered with going on stubhub) is, while not zero, perhaps shockingly close to that.
So my primary take-away is that this is a solution in search of a problem.
And there’s a reasonable chance they’re aware of this.
Here’s what would change my mind on this and what I intend to test when I get access: moderate to high success on web based research projects. I’ll give a couple of examples.
Let’s say I’m considering taking a new medication. Here’s a list of the questions I’d like answered (which frustratingly are somehow not already co-located in a single source — if you know of one, please let me know):
What’s the number needed to treat (NNT)?3
How long until I should expect efficacy from the intervention?
How large of an impact should I expect?
What’s the half-life of the medication in my body?
What are contraindications to stop taking the medication?
What likely happens if I stop taking the medication?
Do I need to titrate up (or down)?
What are the common side effects along with their rough prevalence rates?
What are the tradeoffs between this and other similar medications (what even are the candidate alternatives)?
Are there any interactions with [long list of supplements]?
Is there any supplement that fulfills a similar function that should be considered?
Should it be taken at a particular time of day?
Is this recommendation based on efficacy or to minimize side effects?
What’s the latest peer reviewed meta-analysis of this medication and family of interventions?
Has anything potentially changed in the literature that my doctor may not be familiar with?
This may seem excessive, but in my experience medications are recommended based on population level effects and as a single individual in that population all I really care about is how its likely to impact me. I don’t do this level of effort for every intervention, but at least for some and some subset for pretty much anything new I’d consider taking.
A research project like this can be accomplished with something like 10-20 web searches (some specifically with google scholar), reading half a dozen to a couple dozen articles, synthesizing the information at each step and deciding on the next follow up path and then quitting when feeling good enough. If an AI agent like operator could do even 80% of the work here, it would be great. Doing a good job at something like this takes me at least an hour and sometimes several depending on the complexity and stakes involved. I could also imagine making an initial template that can just be improved over time so that all of my medication analyses can have the same standardized formats and tables to make it even easier to make comparisons and be reminded of past decisions.
And if Operator can do this, it can do lots of similar things:
competitive business analysis
feature parity assessment
market opportunity
ideal customer framing
finding hotels with onsite restaurants that accommodate specific dietary needs
providing a top list of local vendors according to some nuanced criteria
when is the best time of year to travel to X — not just the standard recommendation but based on my particular preferences and needs
Lots of opportunity both for personal use and business use. So I do see potential here, even if it wasn’t demoed. For me, use cases like this would fit into the category of meaningfully impactful in our lives — it could save me hours of time on tasks that are important to me but don’t bring joy in the doing of and could have massive impact on a very large number of business tasks. But, there is no guarantee4 that Operator is currently capable of any of this.
Ok, so why should I care anyway?
This section will cover a few points as to why this still matters even though the demo’s themselves were underwhelming and as is this provides limited value.
Its still just January 2025
We’re less than a month into the new year. Even though I was predicting that 2025 would have a huge emphasis on agents, I’m still quite surprised to see this announcement so early. If this had been launched in June, I probably would have thought — right on track, this version is pretty bad and in about 6 months we’ll have something much better — confirmation bias of previous prediction at maximum. So, this either moves up the timeline to useful agents substantially, meaning mid-2025 instead of year end. Or provides a lot more wiggle room and capacity for false starts and unanticipated difficulties for something above my arbitrary threshold by the year end.
Agent(s)
When Altman intro’ed Operator, he said that OpenAI was launching its first Agent. Which implies plans for more than one agent. Operator may be limited in its utility because it was chosen as the thing that could be shipped now to continue momentum and build excitement. Other agents that could be in development may have a different focus or set of capabilities that are much more useful, but need more work before revealing.
CUA (computer using agent)
This looks quite impressive. On one hand its completely absurd for computers to be communicating using screen shots of websites and mouse and keyboard. It’s so deeply inefficient compared to API based communication. But, at the same time, it’s so generic and broad in its utility that its a great achievement. I’d file this under “things that are easy for humans but hard for AIs” which when I wrote about recently I was less concerned about. I make an exception for things in this camp though because they are so clearly useful as glue. If this “just works” it makes so many things of agents interacting with the digital world possible.
In order for it to accomplish that feeling of “just working”, they’ll need to make substantial progress on the benchmarks they mentioned. But one of the takeaways of recent AI progress is that if you can benchmark it, the models can improve at it.
Productization
One of my core hypotheses is that in many ways the models are already good enough (although better models certainly won’t hurt) and so much of the value to be unlocked is through building products that use the models well. Like, the marginal value of an extremely good recommender system in the abstract is quite limited to a typical person, but build these systems into just about everything on the internet and you have huge impact.
Agents and Operator are a step in this direction. Something that end to end solves a problem for a user without them needing to know so many of the intermediate details or even being aware that they’re interacting with AI. The more productization work that gets done this year, the bigger impact and rate of change for society as a whole.
Wrapping things up
Ultimately, I think i’m going to do a partial update to my “useful agents” timeline and say maybe September or October of this year; holding out the option to readjust substantially when I actually get around to trying Operator out.
I don’t yet have opinions on DeepSeek5, but you might want to check it out. My limited understanding of suggests it may have been a particularly good use of synthetic data that I alluded to in the frequently asked questions post regarding the data wall. Its release may also very well be related to the large selloffs in NVDA and TSM (my tentative guess is that this is an over-reaction….but worth continuing to follow).
Claude just launched citations, which for the relevant domains seems absolutely critical.
If you want a video summary of all the latest news, consider checking out the latest AI Explained overview.
I feel like I’ve been on the “things will happen faster than we all expect” orientation and its still sort of surprising and a bit disorienting when I see things like this sooner than I thought we would — but given the assessment in the top half of this, it’s a little hard to know exactly how much to update.
like trying the latest vegan cheese, which while much better than historically is still probably pretty bad. Unsolicited vegan cheese recommendations: Miyoko’s Mozzeralla not to be confused with the sort of hyped but very disappointing anti-recommendation of Miyoko’s pourable mozz. And Kite Hill’s ricotta can be turned into an excellent queso fresco substitute if you dry it out with some salt overnight.
One of the most important stats for an intervention in my view and yet somehow barely in the mainstream discourse
And realistically, it’s pretty unlikely, if you could demo this…why wouldn’t you?
TLDR: o1 quality model trained for a fraction of the cost, partially open sourced?, major result from a Chinese company