Teen Developer Uses Minecraft to Create More Intuitive AI Benchmarks

A high school senior created a Minecraft-based platform for comparing AI models' building capabilities.
Major AI labs including Anthropic, Google, and OpenAI subsidize the project.
The visual nature of Minecraft makes AI progress more accessible to the general public than traditional benchmarks.

As AI developers seek more intuitive ways to evaluate model capabilities, a high school senior has created a novel solution using the world's best-selling video game.

Minecraft Benchmark (MC-Bench) pits AI systems against each other in creating Minecraft builds based on user prompts, with humans voting on which construction is superior.

Developed by 12th-grader Adi Singh with a team of eight volunteer contributors, MC-Bench leverages Minecraft's universal familiarity to make AI evaluation accessible to the general public.

Users view two AI-generated builds responding to the same prompt and vote on the better creation, with the contributing model revealed only after voting.

"Minecraft allows people to see the progress [of AI development] much more easily," Singh remarked. "People are used to Minecraft, used to the look and the vibe."

The website has gained support from major AI labs, with Anthropic, Google, OpenAI, and Alibaba subsidizing the project's use of their products to run benchmark prompts, though they maintain no official affiliation.

While MC-Bench currently focuses on simple constructions like "Frosty the Snowman" or "a tropical beach hut," Singh envisions expanding to more complex, goal-oriented tasks.

"Games might just be a medium to test agentic reasoning that is safer than in real life and more controllable for testing purposes," he explained.

The project addresses fundamental challenges in AI evaluation. Traditional benchmarks often favor models' narrow capabilities in memorization or basic extrapolation, making it difficult to assess their real-world usefulness.

By contrast, MC-Bench provides visual feedback that's intuitive for humans to judge.

Singh believes the results are meaningful: "The current leaderboard reflects quite closely to my own experience of using these models, which is unlike a lot of pure text benchmarks."

Edited By Annette George