Unveiling the Truth: xAI’s Controversial Grok 3 Benchmark Claims Examined

In the realm of artificial intelligence (AI), the reliability and transparency of performance benchmarks have come under fire. This week, accusations emerged that xAI, a company founded by Elon Musk, may have misrepresented the results of its Grok 3 AI model.

According to an OpenAI employee, xAI’s benchmark results presented Grok 3 as superior to OpenAI’s best available model, o3-mini-high, on the AIME 2025 mathematics exam. However, scrutiny revealed that xAI’s results omitted the “cons@64” metric, a widely used indicator that averages a model’s performance over 64 attempts.

When comparing the models at “@1,” or the first attempt, Grok 3 fell short of o3-mini-high and OpenAI’s o1 model. Despite these findings, xAI has promoted Grok 3 as the “world’s smartest AI.”

Igor Babushkin, co-founder of xAI, defended the company’s actions, claiming that OpenAI had also released potentially misleading benchmarks. However, a third party compiled a more comprehensive graph that included the cons@64 metric, providing a more accurate picture of the models’ performance.

One crucial aspect left unaddressed is the computational and financial cost associated with achieving the benchmark scores. AI researcher Nathan Lambert emphasizes that such information is essential for understanding the limitations and strengths of different models.

The ongoing debate highlights the need for transparency and caution in reporting AI benchmark results. As the field of AI continues to evolve, it is imperative to establish reliable and meaningful metrics that accurately convey models’ capabilities and limitations.

Original source: Read the full article on TechCrunch