Scroll Top

A new ACI benchmark test to measure AGI stumps almost all AI

WHY THIS MATTERS IN BRIEF

In the last AGI test AI smashed the benchmark and smashed humans, so humans made a new harder test to put AI back in its place.

 

Love the Exponential Future? Join our XPotential Community, future proof yourself with courses from XPotential University, read about exponential tech and trendsconnect, watch a keynote, or browse my blog.

The Arc Prize Foundation, a nonprofit co-founded by prominent Artificial Intelligence (AI) researcher François Chollet, announced in a blog post on Monday that it has created a new, challenging test to measure the Artificial General Intelligence (AGI) of leading AI models which comes on the back of OpenAI’s latest AI smashing all the old tests to “almost” be awarded the honour of being the first AGI.

 

RELATED
The world's first active duty military AI co-pilot just got its next mission

 

So far, the new test, called ARC-AGI-2, has stumped most models.

“Reasoning” AI models like OpenAI’s o1-pro and DeepSeek’s R1 score between 1% and 1.3% on ARC-AGI-2, according to the Arc Prize leaderboard. Powerful non-reasoning models, including GPT-4.5, Claude 3.7 Sonnet, and Gemini 2.0 Flash, score around 1%.

The ARC-AGI tests consist of puzzle-like problems where an AI has to identify visual patterns from a collection of different-colored squares and generate the correct “answer” grid. The problems were designed to force an AI to adapt to new problems it hasn’t seen before.

 

The Future of AI, by AI Keynote Matthew Griffin

 

The Arc Prize Foundation had over 400 people take ARC-AGI-2 to establish a human baseline. On average, “panels” of these people got 60% of the test’s questions right – much better than any of the models’ scores.

In a post on X, Chollet claimed ARC-AGI-2 is a better measure of an AI model’s actual intelligence than the first iteration of the test, ARC-AGI-1. The Arc Prize Foundation’s tests are aimed at evaluating whether an AI system can efficiently acquire new skills outside the data it was trained on.

 

RELATED
AI beats human experts at ING at pricing currencies

 

Chollet said that unlike ARC-AGI-1, the new test prevents AI models from relying on “brute force” – extensive computing power – to find solutions. Chollet previously acknowledged this was a major flaw of ARC-AGI-1.

To address the first test’s flaws, ARC-AGI-2 introduces a new metric: efficiency. It also requires models to interpret patterns on the fly instead of relying on memorization.

“Intelligence is not solely defined by the ability to solve problems or achieve high scores,” Arc Prize Foundation co-founder Greg Kamradt wrote in a blog post. “The efficiency with which those capabilities are acquired and deployed is a crucial, defining component. The core question being asked is not just, ‘Can AI acquire [the] skill to solve a task?’ but also, ‘At what efficiency or cost?’”

ARC-AGI-1 was unbeaten for roughly five years until December 2024, when OpenAI released its advanced reasoning model, o3, which outperformed all other AI models and matched human performance on the evaluation. However, as we noted at the time, o3’s performance gains on ARC-AGI-1 came with a hefty price tag.

 

RELATED
This new prototype Covid-19 test requires just a smartphone and your voice

 

The version of OpenAI’s o3 model – o3 (low) – that was first to reach new heights on ARC-AGI-1, scoring 75.7% on the test, got a measly 4% on ARC-AGI-2 using $200 worth of computing power per task.

The arrival of ARC-AGI-2 comes as many in the tech industry are calling for new, unsaturated benchmarks to measure AI progress. Hugging Face’s co-founder, Thomas Wolf, recently told reporters that the AI industry lacks sufficient tests to measure the key traits of AGI, including creativity.

Alongside the new benchmark, the Arc Prize Foundation announced a new Arc Prize 2025 contest, challenging developers to reach 85% accuracy on the ARC-AGI-2 test while only spending $0.42 per task.

Related Posts

Leave a comment

Pin It on Pinterest

Share This