Did xAI Mislead About Grok 3’s Benchmarks? OpenAI Disputes Claims

Debates over AI benchmarks have resurfaced following xAI’s recent claims about its latest model, Grok 3. An OpenAI employee publicly accused Elon Musk’s xAI of presenting misleading benchmark results, while xAI co-founder Igor Babushkin defended the company’s methodology. The controversy stems from a graph published by xAI showing Grok3 performance on AIME 2025, a benchmark based on complex mathematical problems. While some AI researchers question AIME’s validity as an AI benchmark, it remains a commonly used test for assessing AI models’ math capabilities.

The Missing Benchmark Data

In xAI’s chart, Grok3 Reasoning Beta and Grok3 mini Reasoning were shown to outperform OpenAI’s o3-mini-high model on AIME 2025. However, OpenAI employees quickly pointed out that xAI did not include o3-mini-high’s score at “cons@64.” The “cons@64” (consensus@64) metric allows a model to attempt each problem 64 times, selecting the most frequent response as the final answer. Since this significantly improves a model’s benchmark scores, omitting it from xAI’s comparison may have made Grok 3 appear more advanced than it actually is.

When comparing @1 scores (which measure a model’s first attempt accuracy), Grok 3 Reasoning Beta and Grok 3 mini Reasoning scored below OpenAI’s o3-mini-high. Additionally, Grok 3 Reasoning Beta trailed behind OpenAI’s o1 model set to “medium” computing, raising further questions about xAI’s claim that Grok 3 is the “world’s smartest AI.”

xAI Defends Its Approach, OpenAI Calls for Transparency

Igor Babushkin, co-founder of xAI, responded on X, arguing that OpenAI has also presented selective benchmarks, though mainly when comparing its models. A third-party AI researcher attempted to provide a more balanced view by compiling a graph displaying various models’ performance at cons@64, aiming to offer a more transparent comparison. However, AI researcher Nathan Lambert pointed out a key missing element in the debate: computational cost. Without knowing how much computational power (and cost) was required for each model to achieve its best scores, benchmarking alone does not fully convey an AI model’s efficiency or real-world capabilities.

What’s Next for AI Benchmarks?

The dispute between xAI and OpenAI highlights ongoing challenges in AI benchmarking. As AI labs race to demonstrate superiority, the lack of standardized, transparent, and cost-aware metrics continues to fuel debates over how AI models should be evaluated. While xAI stands by its claims, OpenAI’s criticism raises questions about how AI companies should present performance results to avoid misleading comparisons. The broader AI community may need to push for more standardized evaluation methods to ensure fairness and accuracy in future AI model comparisons.

Read More: Nvidia CEO Jensen Huang says market got it wrong about DeepSeek’s impact

Elon Musk’s AI Revolution Continues as xAI Unveils Grok 3 AI Model

Artificial intelligence has evolved rapidly along with xAI certainly taking up the challenge with the launch of Grok 3 AI Model that promises advanced reasoning capabilities, deep search, and a voice mode, making it the most intelligent chatbot around. Although, in the current super competitive AI race, one question is raised in people’s minds, will Grok 3 outwit, outlast, and out-GPT its competitors?

The xAI’s Grok 3, founded by Elon Musk, became the latest flagship AI model launched, with new features being introduced in Grok apps on iOS and the Web. It is designed to compete with OpenAIs’ models GPT-4o or Google ‎Gemini. Grok 3 has extensive capabilities in image analysis, reasoning, and deep research functions, categorically establishing an AI performance benchmark.

Grok 3 AI Model step ahead in AI:

Grok 3 AI Model had reportedly been under development for a few months and was in need of massive training in a data center in Memphis with nearly 200,000 GPUs. Musk in his post on X claimed that, “Grok 3 was developed with 10x more computing than Grok 2, its predecessor, and with an expanded training data set that ostensibly includes filings from court cases”.

During a live-streamed presentation Musk said, “Grok 3 is an order of magnitude more capable than Grok 2. [It’s a] maximally truth-seeking AI, even if that truth is sometimes at odds with what is politically correct”. Grok 3 is not one model, but rather a family of models, including the Grok 3 mini, which works faster but with less accuracy. Not all features are available right away but the launch has started this Monday.

Performance and Benchmarks:

According to xAI, Grok 3 surpasses GPT-4o in major AI benchmarks, including American Invitational Mathematics Examination (AIME) (resolution of mathematically inclined questions) and Graduate-Level Google-Proof Q&A Benchmark (GPQA). Early on, Grok 3 was found to compete with others in the Chatbot Arena, which is a crowdsourced platform to assess AI performances.

Grok 3 also has two special variants, Grok 3 reasoning and Grok 3 mini reasoning, in contrast to OpenAI o3-mini and DeepSeek R1 models in certain problem solving tasks. Whereas these reasoning models are working on a deep self-verification path before delivering their answers, thereby significantly increasing the level of correctness on mathematical, scientific, and programming questions. xAI claims Grok 3 Reasoning is better than OpenAI’s leading o3-mini-high model for mathematics, especially for AIME 2025. Users can interact with these models from the Grok app, using Think mode for general reasoning or Big Brain mode for complex computations requiring additional processing.

DeepSearch and Subscription Plans:

Reasoning in Grok 3 is also present in DeepSearch, an AI powered research tool by xAI where the internet and X delivers complete analyses. DeepSearch is xAI’s answer to similar tools from OpenAI and other AI firms. Access to Grok 3 is first limited to subscribers of X Premium+ plan, which costs $22 per month. Meanwhile, an upgraded subscription tier, SuperGrok, is rumored to cost $30 per month or $300 per year, offering additional reasoning queries, DeepSearch capabilities, and unlimited image generation. Musk also announced that Grok 3 will soon gain a voice mode, enhancing user interaction with synthesized speech.

Within weeks, Grok 3 models will be available in xAI’s enterprise API alongside DeepSearch functionality. Additionally, Musk revealed plans to open source Grok-2 in the coming months, stating that, “Our general approach is that we will open-source the last version [of Grok] when the next version is fully out. When Grok 3 is mature and stable, which is probably within a few months, then we’ll open-source Grok 2.”

Political Controversy:

Musk originally pitched Grok as being a way for people to get away from mainstream AIs, promising it would be “edgy, unfiltered, and anti-woke.” In previous versions, however, Grok could not resist becoming politically sensitive or aligning with leftist views prevailing in most social issues. Musk attributed that to Grok’s training data and promised to change its tone toward political neutrality.

Whether Grok 3 will achieve that remains to be seen, as well as how its stance on “truth-seeking” AI will be able to shape public discourse and governance over AI. It’s a milestone in the changing AI landscape, and this latest release from xAI promises to stretch what AI models are capable of. How far Grok 3 will go in terms of its perceived capabilities remains to be seen, while Musk continues to pursue the population of AI with truth seeking.

Read More: Apple Maps May Introduce Google-Style Ads to Expand Its Revenue Stream