Right on the heels of being deeply underwhelmed by OpenAI’s Operator Agent, I’m having the exact opposite reaction to Deep Research. One of the benefits I’m finding of writing more of my thoughts down is that there is this clear record of what I was thinking and a trove of notes to refer to. In being critical of Operator, I was laying out what I would find useful and Deep Research is a great fit for the use cases I mentioned.
Here’s the first Deep Research prompt I tried (which should look quite familiar):
Let’s say I’m considering taking a new medication. Here’s a list of the questions I’d like answered:
What’s the number needed to treat (NNT)?3
How long until I should expect efficacy from the intervention?
How large of an impact should I expect?
What’s the half-life of the medication in my body?
What are contraindications to stop taking the medication?
What likely happens if I stop taking the medication?
Do I need to titrate up (or down)?
What are the common side effects along with their rough prevalence rates?
What are the tradeoffs between this and other similar medications (what even are the candidate alternatives)?
Are there any interactions with [long list of supplements]?
Is there any supplement that fulfills a similar function that should be considered?
Should it be taken at a particular time of day?
Is this recommendation based on efficacy or to minimize side effects?
What’s the latest peer reviewed meta-analysis of this medication and family of interventions?
Has anything potentially changed in the literature that my doctor may not be familiar with?
The medication in question is <redacted>, please compile a report answering all of the above questions.
The results were excellent. One of my core differentiated strengths is doing research: thinking of the right questions to ask, sifting through the right sources, evaluating data that isn’t fully consistent, synthesizing related ideas and focusing on the most relevant details. The report that was produced was better than I could have done.
It was thorough, nuanced and useful. There is no chance that I could have been as consistent in answering many of the questions: I would have gotten bored, tired or distracted. My best guess is that I could produce something comparable in ~20 to ~40 hours of work. If I wasn’t writing for external consumption and was entirely focused on gaining the knowledge for myself rather than expressing that knowledge, it would reduce things considerably, but would still be a whole day task.
This should be shocking. Even though this is exactly where I thought we’d be, it’s still just so different to experience it. We are not ready for this. And imagine what’s next. Enterprise Deep Research that can reference all the proprietary data of your organization: google docs, confluence, code repositories, slack, email, etc. How can this do anything but fundamentally change our relationship to knowledge work1?
One of the reasons that I started this post off with a plea2 for you to try Deep Research is that it seems like the best way to understand where your very near future job status lies. If a large part of your job involves tasks sort of similar to what I described above or could plausibly be done even at an modest level now by Deep Research, it is time to really think about how to become more resilient to the forces of change that are upon us.
I understand that not all results will be as compelling as my first attempt. There will be some natural variation in quality due to strengths and weaknesses of the underlying model as well as substantial variation in prompt quality (I at least, think my prompt was pretty good…). And have we fully solved the hallucination problem? Of course not, but one of the things I say all the time about self-driving cars is that it’s the baseline comparison to humans that matters, not some hypothetical ideal. I believe what I read may have contained mistakes, but I think at no higher rate and perhaps a lower rate than I would have generated myself.
What’s this, a chess analogy tangent, well i never…
I’m a good, but not great chess player. I think Morphy would be pretty happy with my life choices here:
"The ability to play chess is the sign of a gentleman. The ability to play chess well is the sign of a wasted life." — often attributed to Paul Morphy
I could tell the difference between a chess engine that has a 2000 ELO strength vs a 2400 ELO strength, but I would have no capacity to tell the difference between a 2800 ELO engine and a 3800 ELO engine. And I’d probably be pretty strained to even tell the difference between 2400 and 2800 while we’re at it.
This ultimately ends up being a lot more about the discourse around GPT 4.5 than Deep Research specifically, but my point is that I no longer really trust the ability of a typical human to evaluate the quality of an AI model or system. With typical hubris, I think that I can still evaluate the quality, but give it a year and I think even I’ll be able to give that up.
I trust calculators and spell check and alarms and chess engines and many other things more than myself. And in the not too distant future, for many of us, I think we’ll add AI systems to this list.
Another Use Case
From my predictions article, I thought we’d have agents that could:
i want to have an online escape room experience for 8 people that is sci-fi themed and takes inspiration from Martha Wells work, include a welcome email that gets people excited about the idea and all of the code needed to make any virtual interactive experiences work
And we’re not quite there, but we’re really close. I gave Deep Research this prompt:
I want to create a time-travel themed virtual escape room experience for my friends. Please write a short background story that is sci-fi themed (lean funny and silly over anything serious). Then create a series of interactive puzzles (powered entirely by native javascript/html so that they can be easily shared) using this theme. Potentially consider the idea of having participants be in the same "room" but at different times and them needing to coordinate information across the different time periods to solve some of the problems.3
The failure case was unexpected for me. It didn’t write any code. However, I then gave all of its output to Claude 3.7 Sonnet (with extended thinking) and in one shot4 got a working demo. Here are some screenshots:
When I talk about the democratization of creativity, this is exactly what I mean. This is an incredibly rough first draft that I spent just a handful of minutes prompting and its already working! It does the thing I wanted it to. Is it great? No, the puzzles are quite bad, but the form factor is decent and its something5 (it works!?!) and if I wanted to iterate, i think it could become something pretty good.
So why do I think we’re close but not there yet. Well, I had to use two different agents/models and there was a bit of fiddliness and the output is D/D+ level. But I think its important contrast to show a use case that is so different and this well highlights the unevenness of things, but the slope is up and to the right.
Speculators going to speculate
It has seemed really clear to me for some time that we were headed this way, but it emotionally feels so different to experience it. This is already an AI agent that will meaningfully impact my life.
And as far as projecting in the future, the test time compute paradigm of the o1 series of models makes me think its even more likely that we’ll have some version of actually impactful AI agents before the end of 2025 or if they do come out at the end, they will be better than I expected. “Meaningfully impact our lives” is pretty vague, but i think we’ll know it when we see it.
It’s early 2025…rather than end of…
I get this is the most AI-hype forward that I’ve been. And of course, I might be wrong, but overall, I’ve seen enough to stop hedging in some areas.
I absolutely would have paid money for the medication report and only being cheap and irrational would have kept me from paying a pretty decent amount for it. The time savings is a huge value. Now, its pretty unclear to me how often things like this will come up, but in my experience, once you know something is possible with AI and you start using it for that we see more and more opportunities.
All of this makes me double down, nay, triple down on the upcoming disruption to the labor market. My readjustment from this experience is that it’s all coming sooner, faster, better (in some ways) and obscenely disruptively. I’m going to try to spend some more time thinking concretely about6:
what, if anything, should a knowledge worker be doing to prepare?
how does the shape of knowledge work change in a world where producing research and content becomes so cheap, fast and efficient? where does the value add come from?
what are the short term changes and what are the longer term equilibriums that we might find in relation to knowledge work?
in the face of some disappointing UBI trials7, what should we be thinking about at a policy level?
if there is anything in this sort of bucket of things that you think I should add to this list, let me know, but the idea is to try to make the abstract more real, which is what emotionally this Deep Research experiment has done.
I found this Ezra Klein interview with Ben Buchanan, the top adviser on AI in the Biden White House quite interesting if you’re looking for more things to listen to in this space.
Also, seriously go try out Deep Research. There’s a chance that free users might even get 1 or 2 to try out. Or send me your prompt and if I have any prompts left for the month, I’ll consider it…
I would tentatively guess that such a system could already write median level system design docs for many organizations.
If you try Deep Research and are disappointed and I know you IRL, I’ll venmo you the twenty bucks for a 1 month OpenAI Plus subscription.
I avoided the Martha Wells reference and went for something generic just in case that was going to cause copyright issues or refusals. Plus users are currently only allotted 10 Deep Research uses per month and this wasn’t the product surface area I wanted to explore. But go read the Murderbot Diaries (I’ve only read the first one, but it’s really good and it’s more of a failure of my ability to read books than a comment on their quality that I haven’t finished them yet).
I had to change a few file names that got downloaded from Claude with prefixes that didn’t match the expectations…super minor, but perhaps a deal breaker for someone who doesn’t know how to do a tiny bit of debugging around why it wasn’t working. For example, it couldn’t access the css and it just looked like some plain text and none of the buttons worked (because it couldn’t access the underlying js) and you might just shrug and be disappointed.
Almost self-conscious about every italics choice in this article after reading Zvi on writing.
If you’re curious about what GPT 4.5 had to say about this, see this chat where I pretty quickly feed it this whole blog post and then ask some follow up questions. Access for plus users just dropped, so haven’t had much time to explore here, but looking forward to forming some opinions — but super early I feel like I’m on the side of there being something here that if anything is likely underrated.
Not definitive, but I’m less optimistic than I once was about this as a path forward here. Unconditional cash transfers in countries with substantial purchasing power disparity such as through GiveDirectly are still highly effective, but the data I’ve seen on UBI experiments in the US have been less compelling.