Super Mario Emerges as an Unconventional Benchmark for AI Performance

While many consider Pokémon a challenging benchmark for AI, a group of researchers contends that Super Mario Bros. may actually be even more difficult. The Hao AI Lab, based at the University of California San Diego, recently tested AI systems in live gameplay of Super Mario Bros. Their findings showed that Anthropic’s Claude 3.7 outperformed others, followed closely by Claude 3.5, while Google’s Gemini 1.5 Pro and OpenAI’s GPT-4 struggled.

It’s worth noting that the version of Super Mario Bros. used in the experiment wasn’t the original 1985 release; instead, it was run through an emulator integrated with a framework called GamingAgent, which allowed the AIs to control Mario.

Developed in-house by Hao, GamingAgent provided the AI with basic commands such as “move/jump left to dodge” when faced with obstacles or enemies, along with in-game screenshots. The AI produced control inputs in the form of Python code to maneuver Mario. The Hao team asserts that this setup required each AI model to “learn” how to execute complex movements and devise gameplay strategies. Interestingly, they discovered that reasoning models, such as OpenAI’s o1—which approach problems in a step-by-step manner—performed worse than non-reasoning models, even though the latter typically excel on other benchmarks. The researchers noted that one significant hurdle for reasoning models in real-time games like Super Mario Bros. is their decision-making speed; these models often take seconds to determine actions. In a game where timing is critical, even a split second can mean the difference between a successful jump and a fatal fall. For decades, games have served as a benchmark for AI capabilities. However, some experts express skepticism regarding the validity of correlating AI performance in games with broader technological advancements. Unlike real-world scenarios, games are typically abstract, simpler, and provide virtually limitless data for AI training.

The recent attention-grabbing gaming benchmarks have led Andrej Karpathy, a research scientist and founding member at OpenAI, to describe what he calls an “evaluation crisis.” In a post on X, he shared, “I don’t really know what [AI] metrics to look at right now. TLDR: my reaction is I don’t really know how good these models are right now.” At least we can enjoy watching AI play Mario.