Reflecting and Projecting about AI

a snapshot of thoughts about AI (it's a moving target, but let's try to not move the goal posts)

Dec 31, 2024

Hi Folks1,

Agenda:

High Level Assessment / Predictions
Incentives
Mental Models
o3
Biggest Misses

I’m sure i’ll have made some mistakes in this, prioritizing getting this out at all over refining and iterating, but hopefully the high level ideas are still of some value.

High Level Assessment / Predictions

Its probably imprudent to over-anchor on a single data point, but i’m just a human trying to make sense of the world and overvalue my emotional responses to things just like everyone else. I’ll talk more about o3 below, but the main takeaway from all this is that I’m even more convinced of extremely disruptive change than I was 9 months ago. And, if anything, my timelines to that are accelerated (despite apparent limitations in the unreasonable effectiveness of pre-training).

Here are a look back at a few things I wrote about earlier in the year with a few thoughts about how they held up.

AI Agents

In May 2024, i wrote:

All things considered, chatbot interfaces are quite limited, it takes a lot of interaction and effort to get what you want. What will really be a game changer is when AI Agents are working well. Rather than interacting in real time with a model, instead, we will be able to provide high level instructions and then the AI will go off and iterate for hours or days and come back with results. This is currently limited due to the following math:
Say a model produces appropriate results 99% of the time. This seems good. And for real time interaction is great, we can notice the 1% of the time it doesn't work, ignore it and try again. But for autonomous projects, where a model needs to be correct 100 times in a row success would drop to ~36% and if it needed to be correct 1000 times in a row it would drop to 0.00004317124741 of the time. Which is not great.
So what we need to enable AI agents is more "nines" of reliability 99.9 or realistically for many projects something like 99.999 or something like that. This will be a tremendous game changer and enable things like:
do my taxes (where you'll occasionally get emails requesting new documents to be uploaded)
i want to have an online escape room experience for 8 people that is sci-fi themed and takes inspiration from Martha Wells work, include a welcome email that gets people excited about the idea and all of the code needed to make any virtual interactive experiences work
please plan a weeks worth of meals for my family taking into account our existing dietary needs, but we're all trying to shift our macro focus to include more protein than before, include a detailed shopping list (ignoring staples that we frequently have) and an output of the macros
my child is excelling at math, but struggling with reading, please look at these recent homework examples and come up with a set of exercises that will help them with reading -- use a theme around trains
i found these the three podcasts to be excellent and haven't had time to listen to these 11, can you provide a summary of the remaining 11 and let me know if any of them are worth listening to or not. Then provide recommendations for what i should listen to next.
That won't happen overnight and the results may be bad at first. But i think all of this may be possible sooner rather than later and will be the next big thing that continues the hype here. Very rough guesses (to be taken with heaps of salt), we'll have crude versions of this by the end of the year and good enough to meaningfully impact our lives versions by the end of next year. (there are some early efforts at this in the software space -- devin ai, github workspaces copilot, but i don't think we're there yet --- very much still in the human in the loop this is a useful tool phase of things rather than a replacement). And sort of related, this looks dystopian and awful....but maybe future versions won't be ??????

I think this was basically correct. If you follow the AI news at all, AI Agents are getting all the hype. And they are still (at least as i can tell), pretty bad. Devin AI is available for $500 a month, if anyone uses it and has opinions, it would be great to hear them. But i think its pretty much a “crude version by the end of the year.”

And as far as projecting in the future, the test time compute paradigm of the o1 series of models makes me think its even more likely that we’ll have some version of actually impactful AI agents before the end of 2025 or if they do come out at the end, they will be better than I expected. “Meaningfully impact our lives” is pretty vague, but i think we’ll know it when we see it.

One of the reasons agents matter so much is that they are so much more scaleable. If you no longer need a human in the loop to accomplish meaningful tasks, then you are able to move from AI as accelerant and tool to AI as replacing and scaling. Even if the impact isn’t that large on the consumer side of things, the impact in business will be huge if these are ever made to work in a viable fashion.

GPT5

Also in May, i wrote:

Part of whats contributed to my views so strongly has been interacting with new model upgrades and comparing them to previous iterations. My coding experience with Gemini 1.5 is really something, i'm now surprised and actively confused when its wrong. For the legal project, it was shocking how much things changed when switching from gemini 1.0 to 1.5. Basically it went from, well we're on the right track, i think there's something here to oh this is it, we've probably done it, lets work on things other than quality.
The next round of model upgrades could be a really big deal. Many people are still primarily interacting with models that finished training in August 2022 (gpt4 for example). Thats getting close to two whole years ago, which is forever in this space. I think agents will be a much bigger advance than this, but these model upgrades will likely be part of what enable agents. And even without agents, i still think these updates will be really impactful.

I’ll talk more about this in the limitations of pre-training section, but i think this is mixed. In a way, if you look at Claude Sonnet 3.5, Gemini 2.0 and o1 and o3, then yes, the next model releases were indeed really impactful. However, what I and I think some others misunderstood was that the gains would continue to come from just scaling up pre-training. I still want to both see actual benchmarks and interact with a GPT5 scale model before being fully convinced of this miss because I think there’s a pretty big difference between: not as much better as we expected and not much better at all.

A bit in my defense, I did write this in March:

AI progress is happening much faster than I would have expected. Scaling by just throwing more compute at the problem has been surprisingly effective. There was no good apriori reason to think this would work as well as it does. Order of magnitude improvements are possible every 2 years (not guaranteed, but possible). And in addition to "just" scaling being very effective, a lot of highly skilled people are working on algorithmic improvements as well.

Actual Change in work and the economy

In March I wrote:

This should be transformative to multiple industries. But here is where things very well might be a lot slower to change because they are cultural and process related in nature. The tech already justifies and will shortly even more so justify tremendous changes in work, but it won't happen quickly.

I think I was very right about this. And if anything, i continue to see this as the primary bottleneck to change. I still know of software engineers who do not use AI. This may be due to this (from March):

Hot take: Engineers that aren't seeing significant improvements with the use of ai coding tools are just doing it wrong.

And that I still believe, even more so than before.

So, i think the main protective barrier we continue to have from overly rapid change and disruption is simply our deeply risk-averse incentive structures and the glacial pace of organizational change.

2025

Just a few bullet points here for now:

Heavy focus on Agents, for at least some use cases they are effective
- Even for effective agentic use cases, they may not yet be cost effective yet, but that seems likely to be a not that hard 2026 problem
Less emphasis on pre-training, more emphasis on test time compute
- More effort and results in other algorithmic improvements
  - Its probably not entirely coincidental that test time compute rose in prominence just as pre-training was in decline
  - I am also not fully convinced that there aren’t still gains to be had in pre-training, i do not think a narrative that this is a dead end is either useful or likely to be true
The first industry that will be hugely disrupted (and it will happen in 2025) is customer support agents in call centers
- At least one firm will do major layoffs in this area and replace/supplement workers with AI systems
- Customer satisfaction will go up not down
- This will become mainstream enough that normal people will become increasingly concerned about whether it will impact their industry and jobs
  - I don’t think 2025 will be peak AI anxiety, I probably expect that more in 2026 or 2027
AI products will begin rapidly evolving
- People will still think primarily of chat interfaces as AI, but this will begin shifting in 2025 and by 2026 AI will be pervasive enough that the predominant way of using AI will not be through chat / co-pilots
  - It won’t be until 2027+ that it shrinks fully into the background like ML is today and is just taken for granted in most software
People begin taking the national security risks of AI more seriously, but not enough to meaningfully do anything
- There is still a window to act in 2026, but we really really need to take it then
At least one, but certainly not the majority of major tech companies dramatically changes hiring methodology in light of AI
- For example, allowing the use of models during an interview, but having higher expectations of what can be accomplished in 40 minutes
- Just generally moving away from leetcode style questions
  - With AI assistance, it should be much easier for interviewers to make fictional code bases with bugs they’d like a candidate to refactor as a potential alternative
- This will still not be the norm, i’ll still complain about how tech hiring is at an all-time high of being non-representative of the actual job

Incentives

This might not warrant a top level item, but i think its good to at least gesture at a thing that clouds most current discourse on AI and the future.

The people best positioned to really understand the current state of things and the near future almost all have a vested interest in AI being useful and transformative — they work for and built AI companies and the financial outcomes are dependent on AI not being hype.

And in contrast, i believe there is a sort of game theoretic value in being an AI skeptic. If AI transforms everything, then no one really cares that you were wrong about this thing, but if AI doesn’t then your status rises in a post bubble AI.

I’m not so cynical as to think that this drives all underlying analysis or belief, but even if its subconscious, it can easily shade things. To be explicit about my motivations: i want to be right about things AND if I do happen to be right and I help even 1 person be more prepared for the future, then great.

Mental Models

There are a few things i want to cover in this section: the limitations of pre-training, caring too much about what the models can’t do, using the models for the wrong things.

Limitations of pre-training

If you’ve been following the online discourse over the last few months, you may be under the impression that AI is “hitting a wall” or that progress is slowing down dramatically. (in which noted AI / LLM Skeptic Gary Marcus declares victory). There is some truth here, but also some almost willful mis-interpretations as well.

My summary is that major labs began to see evidence that the difference between gpt5 scale models and gpt4 scale models was noteablely less impressive in terms of emergent capabilities than the change from gpt3.5 to gpt4. This is an important finding. Part of what has lead to much of the enthusiasm and expectation is that these “scaling laws” have held so consistently: more data, more parameters, more compute ==> better model. The unreasonable effectiveness of this recipe is what has been the primary driver of AI progress over the last couple of years.

But less impressive than expected does still not equate to no gains. And the part that feels like willful misinterpretation from people who should know better is conflating the pre-training component with all possible AI progress. Many highly motivated and skilled people are heavily incentivized to find clever and novel solutions to the many problems in this space. Should we adjust our thinking in response to this finding, yes. Should we massively over-correct and expect no further improvements, no.

I think this is some evidence that we are unlikely to be living in Leopold Aschenbrenner’s projected future. Even if the overall picture is less likely, I still think the ideas in there merit consideration and thought (its a bit of a project though).

What the Models Can’t Do

As a thought experiment, consider a company who employs a software engineer who can, with considerable time and care improve the efficiency of the core data processing of the company saving them potentially millions of dollars a year. This same software engineer is unable to change the color of a button on the UI that customers see. As someone with a vested interest in the success of the company, how should i evaluate this engineer? Should I focus on this seemingly weird short coming or should I be absolutely stoked that this engineer can save the company millions of dollars?

I think this is a strange mistake that a lot of people make when thinking about the capabilities of AI models. Rather than being interested and curious about how to get the most out of the things the models are good at, they seem weirdly focused on the things the models don’t do well. Spending time and effort to get that software engineer to be able to adjust trivial UI elements is such a waste, just let that software engineer contribute where they are most useful.

Of course, some of this discourse is around AGI ( artificial GENERAL intelligence) and the need to be able to do things that easy for humans is baked into a lot of people’s notions of “general” in this case. I don’t know if i have a strong opinion here, i’ll need to think more about this. But overall, i think this line of thinking distracts from the potential utility and upcoming changes. If an AI model or system can do novel mathematics research, then let’s use it for that and not worry that it can’t do things a 5 year old can do. Because the 5 year old can’t do novel mathematics research. Maybe comparative advantage for humans will end up being a good thing.

This line of thinking about what the models can’t do is pervasive. I find it very confusing. My best guesses are that its rooted in emotionally defensive goal-post moving. I do think that people who are into chess are slightly better prepared for the current moment because we’ve had over two decades to get used to computers being crushingly better at something we care about (all without it taking away any of the joy of playing — this is an important subpoint).

Using the Models for the Wrong Things

One of the reasons that we’ve had less economic impact from AI so far is that I believe there is a widespread misunderstanding in how to “productize” the models. I’ll give two concrete examples of this kind of thinking, but hopefully they are illustrative of a whole class of misguided thinking here (this is related to some degree to the above point discussion of what the models can’t do, but its also fundamentally different in some ways).

Natural Language Interfaces

In software, we’ve developed many effective techniques to get user input — forms, menus, wizards, drop downs, autofill, autocomplete, etc. These form the backbone of user interfaces for all kinds of software applications. And you know what. They’re pretty good. They are not the problem. And yet.

Tech folks seem oddly insistent on creating natural language interfaces for things that don’t need them. Let’s give people access to a prompt and do all kinds of gymnastics to turn that into the actual data we need. I mean if the models were so far advanced that this was trivial, then i guess, sure. But there are so many failure modes here and I just don’t know who this is for. Technical users of software will not want this because it will be slower and more error prone than traditional interfaces. Maybe there is some utility for novice and less technical users, but in these cases I think you’re harming the ability of a novice user to proceed to an intermediate user and for many situations just making things worse because the novice user will have more difficulty understanding why things failed (and the range of trial and error inputs expands so dramatically that this novice strategy is rendered almost completely useless).

You get a co-pilot, you get a co-pilot, you get a co-pilot!

This one feels much easier for me to understand. ChatGPT was many people’s first exposure to LLMs and it has dramatically shaped the public consciousness of how or what AI is. Creating a mediocre co-pilot experience for X (not Twitter, this is just a variable) is a thing you can do to make it feel like you’re “doing AI”. But it’s so easy for this to not be any better than a user who just wants to copy and paste some things into the model of their choice and its also pretty easy for it to be actively worse.

A co-pilot that is supposed to makes things easier but doesn’t quickly becomes an extremely frustrating customer experience. The bar to release a co-pilot should be much higher than it currently is.

Going Forward

One of my beliefs is that even if literally all model progress stopped today, we continue to see a lot of change and innovation over the coming decade is rooted in a belief that we will collectively get much better at understanding how to use the models effectively and build truly useful products in the coming years. This takes time. From my own experience, i often need to build a bad version of something once or twice before I really understand how to build the good version of thing (this is part of why I pragmatically want to ship early bad versions and iterate much sooner than most people are comfortable with). And the entire industry is going through these growing pains all at once.

What should the Models be used for?

This is an incredibly reasonable question, i do have some initial thoughts here and hope to write more about this in the future.

o3

The o1 series of models of which o3 is the latest announced (and not yet released) are important for a few reasons.

The effectiveness of test time compute

This is essentially giving the model more time to “think” at inference time. Rather than spitting out the first answer, the model potentially gets to plan and execute multiple queries. It was hoped that this approach would help with “reasoning” type tasks and I think the evidence is quite compelling that it did. This paradigm matters because this is another lever that can be used instead of just pre-training. Another reason this paradigm matters is that the improvements here can potentially be faster than pre-training. For example o1 was released quite recently and just a few months later o3 is available (o2 was skipped as a model name for legal reasons i think, so its actually just one model iteration, but 2-3 months is quite a bit different than 6-12 months — although its unclear how far forward to project that trend).

The Benchmarks

The performance of o3 on a couple of particular benchmarks is pretty amazing (there are some caveats and nuanced discussion worthwhile here, but at the end of the day, this is the kind of thing that causes me to have an emotional reaction and have trouble sleeping for a few days).

Arc-AGI

This is the benchmark that has gotten the most attention and hype and while impressive and important, i think this is actually a substantially less important benchmark result than the second one. But let’s talk about it first. This benchmark has been a canonical example of things that are relatively easy for humans but hard for the models. Here’s the write up of the o3 results for more details. But a few highlights:

gpt4o scored ~5%
o3 scored between 76% and 88% depending on how much compute was spent
a mechanical turk worker scores around 75% and a STEM grad around 95%

So, if you spend enough money (which is quite a lot) you can get near human level performance on a task that has traditionally been almost intractable for LLMs.

Why do i care a bit less about this benchmark than others seem to. I’m less convinced that it indicates what its being purported to indicate — namely that the LLM is able to “reason” its way through a novel task. The hope is that such novel tasks will be representative of real world use cases like booking a reservation for a vegetarian restaurant where the model has to understand the uniquely bad software that was written for specifically this restaurant. I do think there will be some transference to real tasks like this and I do think improvement here is meaningful, but I do think the reaction to this single data point in particular is a bit overblown.

FrontierMath

This is the benchmark that matters more to me because its in the category of hard for humans / hard for AI and I continue to see the most value from AI in this bucket. Yes, easy for humans, but hard for AI might unlock some agentic use cases and be some meaningful glue in the systems. But we’ll also be unlikely to get all the potential innovation value out of AI unless these systems also make substantial progress in problems that are hard for humans.

The FrontierMath benchmark is indeed hard for humans.

FrontierMath problems typically demand hours or even days for specialist mathematicians to solve.

And it is hard for AI. Previous state of the art result was 2%, which given how hard these questions are is still actually kind of impressive to me.

But the o3 model scored 24%. Which while far from benchmark saturation, this is an incredible accomplishment. I have a strong technical background and it would take me literally years of effort (that i don’t have in me) to accomplish something similar (if it was a full time endeavor, i think i could learn sufficient background for a question and solve one in ~3-12 months — wide error bars here).

The reason I think this benchmark matters so much is that its potentially on the verge (or maybe already) able to contribute to novel research (aside from the other productivity enhancing features of AI, just from a raw response to a series of questions or ideas).

This is an amazing tipping point (which to be clear we may not have hit, but we might be so so close). I did not think this would happen so quickly.

“These are extremely challenging. I think that in the near term basically the only way to solve them, short of having a real domain expert in the area, is by a combination of a semi-expert like a graduate student in a related field, maybe paired with some combination of a modern AI and lots of other algebra packages…” —Terence Tao, Fields Medal (2006)

For me, this benchmark will be the most important data point I will be looking at for either o4 or whatever Google and Anthropic release next. If we see rapid improvement here, I think this is a reason to accelerate time lines to massive disruption.

Biggest Misses

This one is really easy for me. Sora is just not that good yet. I know some of this is me not being fully willing to invest in getting better at prompting. But its just so far behind music generation (udio, suno, AI Sloth Army) in terms of being an outlet for creative expression. Maybe we’ll still get there one day. But lets just say that i’m not feeling pressured to start on a screen play just yet.

Overall, this makes sense. Video is a hard and complicated medium. I believe that film and television are the both the best and most important artistic mediums of our time. Updated timelines to me being able to make this a hobby (3-5 years, previously i thought that i might be able to dedicate non-trivial time to exploring this in 2025).

I think its a question of when, not if, there will be a widely popular AI generated song. There’s just too much latent creativity in the world that is enabled by these tools. I’m much less confident of that ever being the case for film or tv, but maybe this view is just too grounded in the present.

But maybe more narrow areas will be solved, like these Veo 2 (google’s video model) generations of influencers.

Happy New Year!

Hope 2025 treats everyone well :)

This effort started as writing emails to a small group of people every few months and i’m finally getting around to setting up substack. (please unsubscribe as needed :) ). Rather than publishing those emails, i’m just quoting from them as useful.

The Future Was Yesterday

Discussion about this post