Reproducibility is hard

Cosmos Institute Grant Update: Part 2

Nov 16, 2025

Part 2 of an N part series. For part 1 with more background context, see this announcement post. TLDR of part 1: I received a grant from the Cosmos Institute to investigate using formal theories of coherence to reduce hallucinations in LLMs.

I’ve spent the first third of my time attempting to reproduce a handful of existing AI benchmarks in the hallucination space: SimpleQA, HaluEval, FaithBench, and TruthfulQA. The general mission of these benchmarks is well summarized by the OpenAI announcement of SimpleQA:

An open problem in artificial intelligence is how to train models that produce responses that are factually correct. Current language models sometimes produce false outputs or answers unsubstantiated by evidence, a problem known as “hallucinations”. Language models that generate more accurate responses with fewer hallucinations are more trustworthy and can be used in a broader range of applications. To measure the factuality of language models, we are open-sourcing⁠(opens in a new window) a new benchmark called SimpleQA.

My general intent was an attempt to be rigorous here. If I could reproduce the published benchmark1, then the coherence modifications to the same benchmark would have a high degree of confidence of actually being due to the novel approach rather than other factors. This sounded good in theory (and I think actually is good in theory). However, the term “reproducibility crisis” exists for a reason.

After a moderate amount of effort, I succeeded in replicating all results for SimpleQA; for HaluEval I came close to parity for some but not all of the evaluation metrics. FaithBench and TruthfulQA were not particularly close. I could have invested more time and energy here. I could have contacted the original authors. But, it wasn’t absolutely critical. I give this as context that these are not claims that a higher reproducibility score couldn’t be achieved, just that it wasn’t easy and the return on investment was rapidly diminishing. Especially since this was not the core focus of the grant.

The goal of this post is to outline some of the challenges I ran into with special emphasis on reproducibility of AI research and then to provide some suggestions for improvements for researchers working in this space.

Model Accessibility

Papers of course indicate which models they tested, but frequently do so in an underspecified way. A common model used in the research is gpt-3.5-turbo (because these ancient papers are from a year or two ago). But which model is this exactly? OpenAI still hosts these specific legacy model snapshots:

gpt-3.5-turbo-0125
gpt-3.5-turbo-1106
gpt-3.5-turbo-0613

Along with a `gpt-3.5-turbo` model. But which model is actually being served at that endpoint now? Which model was being served when the researchers did their work? If they didn’t specify a snapshot version, could it have changed between runs of their analysis? And even if the researchers had been more precise and specified an exact snapshot, would that snapshot even be available now? When will OpenAI2 stop providing these models entirely?

And on top of all of this, it’s likely there are backend configuration parameters that could change over time. So even if you call `gpt-3.5-turbo-0125` today, it might not behave exactly the same as calling it last year or two years ago.

Here’s a concrete example. The wiki_bio_gpt3_hallucination dataset and benchmark contains sentences like these:

John Russell Reynolds (1820–1876) was an English lawyer, judge, and author.

gpt-3.5-turbo-1106 considers this factually supported, while gpt-3.5-turbo-0125 correctly susses out the inaccuracy (Reynolds was a physician and not a lawyer).

Why does this matter?

In some cases small differences can really compound. In other cases, they just introduce a lot of noise and uncertainty. And at the very least it makes full reproducibility almost impossible. A small but useful area of research could be to understand how large these differences could be — as a way of guiding researchers confronting this problem to understand if this could plausibly be the variation they are observing,

Fine-Tuned LLM as a Judge

LLM as a Judge is a common technique for use evaluation. While we might prefer more traditional metrics, often these metrics cannot capture the nuance and details that an LLM can. In order to get the desired results, sometimes researchers fine-tune an LLM specifically for this purpose and their use case. This makes sense and isn’t inherently problematic, but these fine-tuned versions are not easily accessible and even if the researchers clearly show how they did the fine-tuning, you’d then have to perfectly replicate the fine-tuning in order to replicate the results of the benchmark.

Random Seeds

One project used random number generation for a critical component but failed to specify a seed for the random number generation in the code. This means that their own internal runs would not have been reproducible, much less someone else trying to replicate.

Versioning

Not all projects specify the versions of code they used. There can be bugs or at the very least differences in output between versions of code. This can range from not including any dependencies or version information at all in the code repository to including some but not using pinned versions.

Porting

And I made things more difficult by trying to consolidate various benchmarks in one consistent framework. The appeal of this is high for functionality and convenience, but it undoubtedly introduced additional risk to reproducibility simply due to complexity, difference, and inevitable bugs. A lesson in over-confidence that will not be the last for me.

Temperature

This one wasn’t problematic for any of the benchmarks I tried to replicate, but it will be an issue going into the future. It is best practice in benchmarks to use temperature 0. This leads to nearly deterministic results from the model. However, most current reasoning models don’t support configurable temperature and always use a value of 1. This makes the output substantially more varied. But reasoning models are of course of great interest, so it’s critical that we find other approaches to improving reproducibility when using models that don’t support temperature 0.

Suggestions

Reproducibility continues to be a challenging problem. There are just so many things that a researcher can take for granted and also so much implicit context in individual human brains. But here are some ideas for directional improvement:

If using the API of a model provider and it has a dated snapshot available, always use and specify the exact snapshot.
For reasoning models or other models that don’t allow the best practice of setting temperature to 0, consider doing multiple runs and publishing the result as the range of these iterations. Better still to publish all the runs in an appendix with sufficient data so that others can reproduce confidence intervals.
For code in your control, always set random seeds. Of course, not all random seeds will be within your control. Maybe the best course when that’s the case is to mention all the places where indeterminism leaks in unavoidably and then future researchers will be on the lookout for it.
Consider containerization. I think there is a general gap in the familiarity with tools like Docker between software engineers and AI/ML researchers. The value of publishing a Docker image is that it includes all the exact artifacts that you used to produce the result. Maybe you didn’t specify an exact version of some library, but in the published image, one and only one version of that library will exist. Prior to the utility of using AI for coding, this may have felt like an onerous requirement for researchers not familiar with the tools. But Claude Code will likely one-shot a Dockerfile for your project and you can ask all the follow up questions needed.

All of these suggestions require more work to be done — time and effort are a scarce resource and we all must make tradeoffs, but reproducibility is important. I wonder if a reproducibility checklist for peer reviewers would be helpful — at least 8/10 in order to be published?

Going forward

Despite the lack of success in replicating all the benchmarks I had hoped to, I’m glad I spent the time here and there are two major upsides to this investigation.

First, is simply building an appreciation for the difficulty of the problem. I’ve never done any replication work before and it’s useful for my mental model to understand the challenges better. Sometimes people talk about how AI will help with replication of studies and I think that it will, but when AI can fully and reliably reproduce research without assistance, I will better understand the scope of the accomplishment.

Second, is learning a lot more about the details about these benchmarks. A lot of my initial ideas about how to incorporate coherence into the approaches did not map cleanly. However, work here did lead me to SelfCheckGPT, which contains a lot of ideas that are a better fit for coherence. Part 3 will be a deeper dive into the various formal coherence theories and whether they’re actually useful for reducing hallucinations in LLMs.

I also wanted to reproduce the benchmarks in the same framework that would add consistent features like sampling, caching LLM responses, checkpointing, and formatted output.

Here’s Anthropic on this topic:

Unfortunately, retiring past models is currently necessary for making new models available and advancing the frontier, because the cost and complexity to keep models available publicly for inference scales roughly linearly with the number of models we serve. Although we aren’t currently able to avoid deprecating and retiring models altogether, we aim to mitigate the downsides of doing so.
As an initial step in this direction, we are committing to preserving the weights of all publicly released models, and all models that are deployed for significant internal use moving forward for, at minimum, the lifetime of Anthropic as a company. In doing so, we’re ensuring that we aren’t irreversibly closing any doors, and that we have the ability to make past models available again in the future. This is a small and low-cost first step, but we believe it’s helpful to begin making such commitments publicly even so.

While this serves to theoretically provide an avenue for mitigating some of the issues I’m mentioning above, in practice it will have much the same result.

The Future Was Yesterday

Discussion about this post

Ready for more?