Debates over AI benchmarking have reached Pokémon

Pokémon and the Problem with AI Benchmarks

Even Pokémon isn’t immune to the growing debate over how AI models are evaluated.

A viral post on X last week claimed that Google’s Gemini model had outperformed Anthropic’s Claude in playing through the original Pokémon trilogy. According to the post, Gemini had advanced to Lavender Town, while Claude remained stuck at Mount Moon as of late February.

However, there’s a catch the post didn’t mention.

Reddit users were quick to point out that the developer behind the Gemini livestream had created a custom minimap overlay, giving the model a significant edge. This map helps Gemini recognize in-game elements—like cuttable trees—without needing to process screenshots or infer their meaning from raw visual data. That shortcut can dramatically simplify decision-making during gameplay.

Pokémon may not be a rigorous AI benchmark, but the controversy illustrates a broader issue: how implementation details can dramatically impact benchmark outcomes.

The Pokémon case is just one example in a wider trend. Consider Anthropic’s own reporting on its Claude 3.7 Sonnet model using the SWE-bench Verified benchmark, which tests coding ability. With no additional help, Claude scored 62.3% accuracy. But when using a custom scaffold—a developer-created structure designed to optimize performance—it reached 70.3%.

Similarly, Meta recently fine-tuned a version of Llama 4 Maverick to excel at LM Arena, a popular benchmark. The customized version scored significantly better than the base model, highlighting how targeted optimization can skew perceived performance.

These examples underscore a persistent problem in AI: benchmarks are only as fair as the methods used to achieve them. Custom scaffolds, helper tools, or tailored environments may improve performance, but they also make model-to-model comparisons less reliable—especially when those optimizations aren’t clearly disclosed.

In a landscape where benchmarks are already imperfect proxies for real-world intelligence, such inconsistencies threaten to further muddy the waters. If the goal is to understand how models stack up in meaningful, generalizable ways, transparency around evaluation methods will be more important than ever.