Brand Logo

Training Data Contamination in LLMs: A Skeptical Guide to SWE-Bench, LiveCodeBench and MLE-Bench

20
Clap
Copy link
Aerin Kim

Written by

Aerin Kim

LLM coding benchmarks like SWE-bench, LiveCodeBench, and MLE-bench claim to measure real coding ability. But training data contamination, conflicts of interest, and structural flaws undermine every leaderboard. Here's what's actually going on.

When AI companies say their model "solves 70% of real coding tasks," that number comes from a test. But what if the AI already saw the test answers during training? That's like a student who memorized the answer key.

This post is about why the old tests broke, why the new tests are better but not as good as their creators claim, and why you should be skeptical of any number someone puts on a leaderboard.


The Test Everyone Already Memorized

In 2021, OpenAI released Codex — a model fine-tuned for code — alongside a benchmark called HumanEval [1]. HumanEval was just 164 Python problems. Write a function that does X, here's a docstring, here are tests. Codex solved about 29% of them on its first try. That was impressive at the time.

The authors were careful. They wrote the problems themselves specifically so the solutions wouldn't exist on the internet. They even checked that the problems didn't appear in their training data. That was a smart precaution in 2021.

It stopped working almost immediately.

The moment HumanEval was published, its 164 problems became public. People posted solutions on GitHub. Tutorials walked through them. Research papers included them in appendices. Blog posts dissected them. Open-source training datasets scraped them. Within a year or two, HumanEval solutions were everywhere on the internet — which means they were everywhere in the next generation of training data.

Watch what happened to the scores. Codex in 2021: 29%. GPT-4 in early 2023: around 67%. By late 2023, models were clearing 85%. By 2024, some models were above 95%. That's an incredible trajectory. But here's the question nobody can definitively answer: did models get that much better at coding, or did they just memorize a 164-question test?

The same story applies to MBPP [2] — 974 crowd-sourced Python problems that followed a similar arc from research benchmark to internet wallpaper.

Think about it this way. If I gave you a math test, and then next year I gave you the exact same test after the answers had been posted on Reddit for twelve months — and your score jumped from 29% to 95% — would you conclude you'd gotten dramatically better at math? Or would you wonder if maybe you'd just seen the answers?

This is the fundamental problem, and it's not theoretical. Oren et al. demonstrated at ICLR 2024 that you can detect contamination even in black-box models where you can't inspect the training data [3]. Their trick was clever: if a model is suspiciously confident about the exact wording of a test problem — not just the concept, but the specific phrasing — that's a tell. Like catching a student who not only gets the right answer but writes it in the exact same unusual notation as the answer key. They found contamination in several prominent models on several prominent benchmarks.

Other teams have studied the magnitude of the effect and found it varies — sometimes scores are inflated by 10+ percentage points on contaminated examples, sometimes less. The honest summary is that contamination is definitely real, definitely inflates scores in some cases, but its exact magnitude is uncertain and varies by model and benchmark. Anyone who tells you it explains everything is overselling. Anyone who tells you it doesn't matter is ignoring evidence.

But here's the issue: almost nobody publishing benchmark scores includes a contamination analysis alongside those scores. As Sainz et al. argued in 2023 [4], contamination measurement should be mandatory for every benchmark claim — like how clinical trials have to disclose conflicts of interest. Almost no one does this. Labs publish scores, journalists report them, investors value companies based on them, and nobody asks "but did you check if the model already knew the answers?"


Enter SWE-bench: Real Bugs, Real Code, Real Problems

Jimenez et al. introduced SWE-bench at ICLR 2024 with a compelling thesis: HumanEval-style problems are toy problems that look nothing like real software engineering [5]. Real developers don't write sorting algorithms from scratch. They navigate massive codebases, read bug reports written by frustrated users, find the relevant files among hundreds, and write patches that fix the problem without breaking everything else.

SWE-bench operationalizes this by pulling from actual GitHub issues and their corresponding pull requests across 12 popular Python repositories — Django, Flask, scikit-learn, sympy, matplotlib, and others. Each task gives the model an issue description (written by a real user reporting a real bug) and the full state of the repository at that point in time. The model has to produce a patch. The patch is tested against a set of tests. Either it passes or it doesn't.

The initial results were a cold shower. The best model at the time of publication solved about 4.8% of problems. Not 48%. Not even 14.8%. Just 4.8%. This was the same generation of models clearing 85%+ on HumanEval. The gap was staggering, and it said something important: whatever these models had learned (or memorized) from HumanEval had very little to do with the ability to fix a real bug in a real codebase.

SWE-bench also had a contamination story. The tasks were drawn from after January 2023, and the argument was that models trained before that date couldn't have seen them. Plus, even if a model had encountered the same repository in training, the specific combination of issue text + repository state at a specific commit creates a task that doesn't exist elsewhere in exactly that form.

Contamination vs. Relevant Background Knowledge

swe-bench.png


The defense for contamination is this: These are enormously popular open-source projects. Django has been discussed on Stack Overflow, GitHub, mailing lists, and blogs for over a decade. A model doesn't need to have seen the exact SWE-bench instance to benefit. If you've read thousands of Django bug discussions, you develop intuitions about common Django bugs. Is that "contamination"? Is it "relevant background knowledge"? The line is blurry, and SWE-bench doesn't try to draw it.

Then there's the evaluation quality problem. Subsequent work (notably SWE-bench+, published in 2024) found that a non-trivial number of SWE-bench instances had underspecified tests. In some cases, a patch could pass all the tests without actually fixing the bug — it just happened to produce the right output for the wrong reason. In other cases, the issue description was ambiguous enough that multiple valid interpretations existed, but only one was scored as correct. When these noisy instances were cleaned up, model rankings shifted. That means the leaderboard everyone was optimizing against was partly measuring "which model got lucky on poorly specified tasks."

SWE-bench Verified, a later subset, attempted to fix this with human annotation. But who annotated it? How reliable were their judgments? These questions matter a lot and are not fully answered.

And then there's the money problem, the one that's hard to talk about because it implicates everyone: SWE-bench has become a commercial battleground. Cognition used it to market Devin. OpenAI uses it to market their coding agents. Anthropic uses it. A dozen startups put SWE-bench scores on their landing pages. When a benchmark becomes a sales tool, the incentive to optimize specifically for it — through clever scaffolding, prompt engineering, selective evaluation, or just running it hundreds of times and reporting the best result — becomes overwhelming.

This is Goodhart's Law in action: when a measure becomes a target, it ceases to be a good measure. The sociologist Marilyn Strathern articulated this in 1997 [6], long before anyone was thinking about LLMs. But it describes the SWE-bench situation pretty well. The benchmark was designed to measure capability. It now measures "how much effort did this team put into optimizing for SWE-bench."


LiveCodeBench: The Moving Target

Jain et al. proposed the elegant structural response to contamination in 2024 [7]. The core idea: if the benchmark keeps changing, you can't memorize it.

LiveCodeBench continuously scrapes new problems from competitive programming platforms — LeetCode, AtCoder, and Codeforces — and only evaluates models on problems published after their training cutoff dates. Every problem is timestamped. New problems appear regularly. The benchmark is alive.

This lets you do something interesting: compare how a model performs on problems from before its training cutoff versus after. If there's no contamination, performance should be roughly equal (assuming similar difficulty). If there is contamination, the model should do better on older problems it might have seen.

Jain et al. ran this analysis and found what you'd expect: nearly every model performed better on pre-cutoff problems than post-cutoff problems, even after controlling for difficulty. The gap varied across models but was consistent. This is some of the direct evidence that contamination inflates scores on static benchmarks.

LiveCodeBench also tests more than just generation. It evaluates self-repair (can you fix your own buggy solution given test feedback?), code execution prediction (what will this code output?), and test output prediction. This multi-dimensional evaluation gives you a much richer picture than a single pass/fail number.

The cutoff

eval cutoff date.png


The temporal defense works well if you actually know when the training data was collected. And we usually don't — not with the precision required.

When OpenAI says "GPT-4's training cutoff is April 2024," they're talking about the pretraining corpus. But what about fine-tuning data collected months later? What about RLHF feedback data that includes references to recent problems? What about synthetic training data generated by other models that had seen the problems? What about data purchased from third parties whose scraping practices are opaque?

A model with a "training cutoff of April 2024" might have been fine-tuned in October on data that includes LeetCode problems from July. LiveCodeBench can't detect this because it relies on self-reported cutoff dates.

There's also a skill issue: LiveCodeBench measures competitive programming ability, and competitive programming is a niche skill. These are algorithmic puzzles such as dynamic programming, graph traversal, number theory tricks. Many professional software engineers never encounter problems like these in their daily work. A model that aces LiveCodeBench might be terrible at writing a database migration, debugging a race condition, or refactoring a poorly documented legacy module. The "coding ability" is a much broader concept than competitive programming.


MLE-bench: Kaggle Competitions as a Test

live bench.png

Chan et al., in a collaboration between OpenAI and METR, published MLE-bench in late 2024 [8]. The ambition was bigger than coding puzzles: evaluate whether LLM agents can do end-to-end machine learning work.

The benchmark consists of 75 real Kaggle competitions. Not toy datasets — real competitions that real humans competed in, spanning tabular data, computer vision, NLP, and signal processing. An agent gets the competition description and the data. It has to explore the data, engineer features, choose and train models, tune hyperparameters, and produce a submission file. That submission gets scored against the actual Kaggle leaderboard, and the agent receives a medal (bronze, silver, gold, or none) based on where its score would rank among human competitors.

This is a different kind of evaluation. The tasks require long-horizon planning - not writing one function, but orchestrating an entire workflow across dozens of steps. The scoring is calibrated against human performance, not synthetic tests. And the diversity of domains means you can't get by with one trick.

The results at publication: the best agent medaled on roughly 17% of competitions. Performance was highest on tabular data tasks (familiar territory for anyone who's done a Kaggle tutorial) and lowest on domain-specific tasks like medical imaging or audio classification.

An interesting finding: agents hit diminishing returns quickly. Even when given more compute time, performance often plateaued well before the time limit. Human researchers, by contrast, continued finding improvements over the same period. This suggests that current agents lack the ability to keep thinking - to try fundamentally different approaches when their initial strategy doesn't work. They get stuck in local optima.

Conflict of Interest

MLE-bench was published by OpenAI. MLE-bench evaluates agentic capabilities. OpenAI sells agentic products. When the company building the product also designs the test, we need to pause for a second and think.

The contamination defense: Chan et al. argue that even though Kaggle solutions are extensively documented online (winning notebooks, forum discussions, blog write-ups), the specific task of actually executing a solution end-to-end — handling data quirks, runtime errors, library version issues, submission formatting - can't be solved by memorization alone. That's partially true. But it understates how much Kaggle solution knowledge helps.

Here's an example. If you've read 50 notebooks for a specific Kaggle image classification competition, you know that the winning approach used EfficientNet with progressive resizing, the key augmentation was CutMix, the ensemble blended five folds, and the trick that separated gold from silver was test-time augmentation with horizontal flips. You don't need to recall any specific notebook verbatim. You just need to know the approach. And if you know the approach, you can probably medal - especially if you're an LLM with a code execution environment and retries.

This is a softer form of contamination than memorizing HumanEval solutions, but it's still contamination. MLE-bench doesn't measure it and arguably can't.

There's also a subtle methodological issue. The agents aren't just raw models - they're models embedded in scaffolding frameworks (AIDE, OpenHands) that include retry logic, error recovery, iterative refinement, and carefully engineered prompts. When OpenAI reports "o1-preview achieves X% on MLE-bench," what they really mean is "o1-preview plus a specific agentic framework built by specific engineers who iterated on it extensively achieves X%." Disentangling the model's contribution from the scaffolding's contribution is hard. Maybe the model is brilliant. Maybe the scaffolding is doing the heavy lifting. The benchmark doesn't tell you which.

kaggle bench.png



The Other New Benchmarks (And Why They Have the Similar Problems)

A few other recent benchmarks are worth mentioning because they're smart in different ways - and flawed in the same structural way.

BigCodeBench [9] noticed that most code benchmarks test self-contained algorithmic puzzles, but real code is mostly glue — calling libraries, piping data between them, handling edge cases in API behavior. The benchmark has 1,140+ tasks that require composing multiple library calls (pandas + matplotlib + numpy in one solution, for instance). Models that scored 90%+ on HumanEval scored below 50% here. That's a huge gap and it probably reflects genuine missing capability, not just contamination. But library APIs change constantly - a model trained on pandas 1.x might fail on a pandas 2.x task not because it lacks coding ability but because it learned a deprecated API. Is that a capability failure or a staleness problem? The benchmark doesn't distinguish.

SWE-Lancer [10] pulls real freelance tasks from Upwork - actual paid gigs with dollar values attached. A $500 bug fix is presumably harder than a $50 one. The evaluation is whether the solution would meet the original client's acceptance criteria. This is the closest any benchmark has come to asking the economically relevant question: can this model do work someone would pay for? But Upwork pricing is noisy. A $500 task in San Francisco might be a $100 task in Nairobi. The dollar value is a signal, but it's a messy one.

Commit0 (2025) inverts SWE-bench entirely. Instead of patching existing code, models must implement library features from scratch given only API specs and a test suite. You start with an empty file and have to produce working code. This is genuinely harder to game. You can't pattern-match a one-line fix when you need to write hundreds of lines of architecturally coherent code. But it still evaluates via test suites, and test suites are finite. Code that passes all tests might still be not so usable - fragile, unmaintainable, inefficient — in ways the tests don't capture.


The Pattern

Let's zoom out and look at who's building what.

OpenAI publishes MLE-bench and SWE-Lancer - benchmarks that test agentic, long-horizon capabilities. OpenAI sells agentic products.

Google publishes benchmarks emphasizing multimodal and long-context capabilities. Google's models are designed around large context windows and multimodal inputs.

Companies that build coding agents publish coding agent benchmarks where their agents score well.

Each benchmark may be individually well-designed. But collectively, we create a landscape where every lab has a test that makes its own products look best. And engineers, scientists, and journalists are left to navigate this without a neutral arbiter.

Imagine if pharmaceutical companies each designed their own clinical trial protocols, ran their own trials, graded their own results, and then published the numbers in their marketing materials. That's similar to what we have in AI evaluation right now.


What Would Fix This

If we're being honest about what credible evaluation would require, the list is short.

Real-world validation. Does a higher SWE-bench score actually predict that a developer will be more productive using the model? Does a higher MLE-bench score predict successful ML deployments? Nobody knows, because nobody has studied this. We're treating benchmark scores as proxies for real-world value without validating the proxy.

Training data transparency. Most contamination debates would evaporate if companies disclosed their training data. We don't, citing competitive concerns. When someone tells you "our model genuinely learned to code, it didn't memorize the test," and you can't see the training data, you're being asked to take it on faith.

Independent benchmark creation. The people designing the test should have no financial relationship with the people being tested. This is basic. Every other field that depends on evaluation - medicine, finance, education - figured this out. AI hasn't yet.

Contamination tests. Every published benchmark score should ideally be accompanied by a contamination analysis. The tools to do this exist, as Oren et al. demonstrated [3]. However, they are not always used in practice, partly because contamination audits can only risk lowering a score, not raising it.


The Bottom Line

The new generation of code benchmarks — SWE-bench, LiveCodeBench, MLE-bench, and others — represent a clear improvement over what came before. They tend to be harder, more diverse, and more carefully designed with contamination in mind. Still, relying heavily on HumanEval alone in 2025 risks missing much of the picture.

At the same time, “better than HumanEval” is a relatively modest standard. These newer benchmarks are still largely designed by insiders, overseen by parties with a stake in the results, and typically reported without independent verification. Concerns about contamination are real, but they can also function as a narrative device - helpful for anyone promoting a tougher benchmark who wants to explain why their scores look lower (“finally an honest test!”) rather than consider that the benchmark itself might be noisy or limited.

A cautious way to interpret any leaderboard in 2026 is to start with four questions:
Who designed this test? Who funded it? Did they check for contamination? And does this score actually predict anything I care about in practice?

Until those questions have satisfactory answers, it may be prudent to treat any single number as a claim that is partly scientific and partly marketing. It might be accurate and useful - but you can’t know that from the number alone.



Sources

[1] Chen, M., et al. Evaluating Large Language Models Trained on Code. arXiv, 2021.

[2] Austin, J., et al. Program Synthesis with Large Language Models. arXiv, 2021.

[3] Oren, Y., et al. Proving Test Set Contamination in Black Box Language Models. ICLR, 2024.

[4] Sainz, O., et al. NLP Evaluation in Trouble: On the Need to Measure LLM Data Contamination for each Benchmark. Findings of EMNLP, 2023.

[5] Jimenez, C.E., et al. SWE-bench: Can Language Models Resolve Real-World GitHub Issues? ICLR, 2024.

[6] Strathern, M. "Improving Ratings: Audit in the British University System." European Review, 5(3), 305–321, 1997.

[7] Jain, N., et al. LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code. arXiv, 2024.

[8] Chan, J., et al. MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering. arXiv, 2024.

[9] Zhuo, T.Y., et al. BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions. arXiv, 2024.

[10] Miserendino, S., et al. "SWE-Lancer: Can Frontier LLMs Earn Money on Real-World Freelance Software Engineering?" OpenAI Research, 2025.