OpenAI’s o3 AI Model Scores Lower Than Claimed, Raising Transparency Concerns

OpenAI’s o3 AI model scored around 10% on the FrontierMath benchmark in independent tests, far below the company’s initial 25% claim.
The higher score was achieved using a larger, more powerful internal version of o3, not the public release, which is optimised for speed and usability.
The incident underscores growing concerns about transparency and reliability in AI benchmark reporting across the industry.

OpenAI is under continued scrutiny after independent tests revealed that its o3 AI model performs significantly worse on a key math benchmark than the company initially suggested.

When OpenAI unveiled o3 in December, it claimed the model could solve just over 25% of problems on the challenging FrontierMath test, far surpassing rivals, which managed only about 2%.

However, new results from Epoch AI, the research group behind FrontierMath, show the public version of o3 scores closer to 10%, well below OpenAI’s headline figure.

The discrepancy appears to stem from differences in testing conditions. OpenAI’s higher score was achieved with a more powerful, internal version of o3 using greater computational resources, whereas the public release is optimised for speed and real-world use, sacrificing some raw performance.

The ARC Prize Foundation, which tested a pre-release version, confirmed the public o3 is a smaller, chat-focused model, not the one that achieved the highest benchmark scores.

OpenAI staff acknowledged these differences, emphasising that optimisations were made for efficiency and user experience.

While newer models like o3-mini-high and o4-mini now outperform o3 on FrontierMath, the episode highlights the need for skepticism about AI benchmark claims, especially as similar controversies have recently affected other major AI companies.

Edited by Annette George