Allegations of questionable benchmark results have surfaced in the AI industry, involving xAI’s Grok 3 and OpenAI’s models.
An OpenAI employee accused xAI of presenting misleading benchmark data for Grok 3, showcasing its superiority over OpenAI’s models on the AIME 2025 exam. However, xAI contends that its reporting is accurate.
A key point of contention lies in the omission of “cons@64” data from xAI’s graph. This metric provides models with multiple attempts to solve problems, potentially inflating their scores. When “cons@64” is included, OpenAI models surpass Grok 3’s performance on AIME 2025 at single-attempt evaluations.
xAI maintains that OpenAI has also presented similar instances of selective benchmark reporting. Independent analysis suggests that “cons@64” results provide a more comprehensive view of model performance.
However, AI researcher Nathan Lambert emphasizes the lack of transparency around the computational and financial resources required to achieve these benchmark scores, highlighting the limitations of current AI benchmarks in conveying model capabilities and constraints.
Original source: Read the full article on TechCrunch