Meta’s benchmarks for its new AI models are a bit misleading

On Saturday, Meta introduced one of its new flagship AI models, Maverick, which secured the second position in LM Arena, a platform where human evaluators compare model outputs and choose their preferences. However, it appears that the version of Maverick assessed in LM Arena differs from the one accessible to developers.

Several AI researchers noted on X that Meta’s announcement indicated that the Maverick model featured in LM Arena is an “experimental chat version.” Additionally, a chart on the official Llama website revealed that the LM Arena tests utilized “Llama 4 Maverick optimized for conversationality.”

As previously mentioned, LM Arena has faced criticisms regarding its reliability as a measure of AI model performance for various reasons. However, it is uncommon for AI companies to customize or fine-tune their models specifically for better performance in LM Arena, or at least they haven’t openly disclosed such practices.

The issue with customizing a model for a specific benchmark, keeping it undisclosed, and then releasing a “vanilla” version is that it complicates developers’ ability to predict how the model will perform in specific situations. Furthermore, this practice can be misleading. Ideally, benchmarks—despite their shortcomings—should offer a snapshot of a model’s strengths and weaknesses across different tasks.

Researchers on X have noted significant variations in the behavior of the publicly accessible Maverick compared to the version available on LM Arena. Notably, the LM Arena variant reportedly uses many emojis and provides excessively detailed responses.