A New AGI Benchmark Has Arrived — And It’s Leaving Most AI Models in the Dust

By TechCrunch AI Team | March 24, 2025

The quest for Artificial General Intelligence (AGI) just hit a new milestone — or rather, a new wall. The Arc Prize Foundation, co-founded by renowned AI researcher François Chollet, has released an upgraded version of its intelligence benchmark: ARC-AGI-2. And it’s proving to be a brutal test for today’s smartest models.

So far, even the most advanced AI systems — from OpenAI’s GPT-4.5 to Google DeepMind’s Gemini 2.0 Flash — have scored just around 1% on the ARC-AGI-2 test. That’s not a typo.

What Is ARC-AGI-2?

Unlike typical language benchmarks or image recognition tests, ARC-AGI-2 evaluates raw intelligence and adaptability. The test throws puzzle-like visual problems at AI models, requiring them to detect patterns from multicolored square grids and generate the correct answer grid — without prior exposure to similar problems.

These aren’t just exercises in memory or brute force. They’re designed to mimic how humans approach unfamiliar problems and adapt in real-time — a key marker of true general intelligence.

Humans Are Still Ahead — By a Mile

To establish a baseline, the Arc Prize Foundation had over 400 people complete ARC-AGI-2. The average accuracy? A solid 60%. In contrast, models like OpenAI’s o1-pro and DeepSeek’s R1 landed between 1% and 1.3%, while non-reasoning models (including Claude 3.7 Sonnet) also hovered around 1%.

What Makes ARC-AGI-2 Different?

One of the major criticisms of the original ARC-AGI-1 test was that AI models could “brute force” their way through problems with enough compute power. ARC-AGI-2 fixes this by focusing on efficiency.

For example, OpenAI’s high-performing model o3 (low) scored an impressive 75.7% on ARC-AGI-1 — but only hit 4% on ARC-AGI-2, even when using $200 of computing power per task. Clearly, raw horsepower isn’t enough anymore.

ARC-AGI-2 shifts the question from “Can AI solve this?” to “Can AI solve this like a human — and at a reasonable cost?”

A New Challenge for Developers

With the launch of ARC-AGI-2, the Arc Prize Foundation also announced the Arc Prize 2025 Challenge. The goal? Build a model that scores at least 85% on ARC-AGI-2 — with just $0.42 in compute per task.

This contest is set to attract top talent in the field, especially at a time when experts like Hugging Face’s Thomas Wolf are calling for more robust AGI benchmarks. As AI rapidly progresses, traditional benchmarks are proving too easy, too fast.

Why It Matters

As we inch closer to AGI, we need better ways to measure progress. ARC-AGI-2 is a timely reminder that being smart isn’t just about having a large training set or massive GPUs. True intelligence is about adaptability, reasoning, and efficiency — the same traits that define human cognition.

With most models still failing at this test, the race for general intelligence is far from over.


Stay Updated

Follow TechCrunch AI for more insights into emerging benchmarks, research breakthroughs, and the future of artificial intelligence.

📧 Subscribe to our AI newsletter
🎤 Meet us at TechCrunch All Stage 2025 — Boston, July 15