How Intelligent Is AI, Really?
By Y Combinator
Key Concepts
- ARC (Abstraction and Reasoning Corpus): A benchmark designed to measure an AI’s ability to learn new concepts and generalize, mirroring human intelligence.
- AGI (Artificial General Intelligence): AI with the ability to understand, learn, adapt, and implement knowledge across a wide range of tasks, much like a human.
- Vanity Metrics: Metrics that look good on the surface but don’t necessarily indicate meaningful progress towards a core goal.
- Generalization: The ability of a system to apply knowledge learned in one context to new, unseen situations.
- RL (Reinforcement Learning): A type of machine learning where an agent learns to make decisions by performing actions in an environment to maximize a reward.
- Human-Level Performance: Evaluating AI not just on accuracy, but also on the time, data, and energy required to achieve results, compared to human capabilities.
The Arc Prize Foundation and the Measurement of Intelligence
The Ark Prize Foundation, a tech-forward nonprofit, is focused on accelerating progress towards Artificial General Intelligence (AGI). Their core approach centers around a unique definition of intelligence, stemming from François Chollet’s 2019 paper, “On the Measure of Intelligence.” This definition posits that intelligence isn’t about solving increasingly difficult problems within a single domain (like harder SAT questions), but rather the ability to learn new things efficiently. Greg Camrad, President of the Ark Prize, emphasizes this distinction, stating, “Intelligence as your ability to learn new things.”
The ARC Benchmark: A Shift in Evaluation
Traditional AI benchmarks, like MMLU and its variants, focus on pushing the boundaries of performance within specific, well-defined tasks. The ARC benchmark, however, is designed to assess a system’s capacity for generalization. Unlike benchmarks that require “PhD++” level problem-solving, ARC tasks are designed to be solvable by the average person. This ensures the benchmark tests genuine learning ability, not just rote memorization or specialized expertise.
Initially, in 2024, base models like GPT-4 performed poorly on the ARC benchmark, achieving only 4-5% accuracy. However, the introduction of reasoning capabilities, exemplified by the release of Gemini 0.1, saw a significant jump to 21%. This rapid improvement highlighted the importance of reasoning as a key component of AI progress. Consequently, major AI labs – OpenAI, XAI, Gemini, and Anthropic – now utilize the ARC benchmark to report their model performance.
Avoiding Vanity Metrics and Identifying False Positives
While the adoption of ARC by leading AI labs is positive, the Ark Prize Foundation cautions against treating benchmark scores as the sole indicator of progress. Camrad warns of “vanity metrics,” emphasizing that the foundation’s ultimate goal is to inspire open AGI progress, not simply to see high scores on a single test.
A common “false positive” observed is over-reliance on Reinforcement Learning (RL) environments. Camrad argues that creating specialized RL environments for every potential task is unsustainable and doesn’t address the core challenge of AGI: handling novel problems. He states, “You’re not going to be able to make RL environments for every single thing you’re going to end up wanting to do.” He advocates for investment in systems that can generalize without requiring task-specific environments, mirroring human learning.
The Evolution of ARC: From Static to Interactive
The ARC benchmark has evolved through three versions:
- ARC AGI 1 (2019): The original benchmark, consisting of 800 tasks created by François Chollet. It established the foundation for measuring generalization ability.
- ARC AGI 2 (2025): A deeper, upgraded version of ARC AGI 1, maintaining the static benchmark format.
- ARC AGI 3 (Coming 2026): A significant departure from previous versions, ARC AGI 3 will be interactive. It will feature approximately 150 video game-like environments where the AI must take actions and observe the consequences. Crucially, these environments will provide no instructions – no text, symbols, or guidance – forcing the AI to discover the goal through experimentation.
This shift to interactivity is motivated by the belief that true AGI will require interaction with a dynamic environment, mirroring the real world. The Ark Prize Foundation will also test humans on each game in ARC AGI 3, ensuring that any game unsolvable by humans will be excluded, maintaining the benchmark’s focus on human-level intelligence.
Measuring Efficiency: Beyond Accuracy and Time
The Ark Prize Foundation is moving beyond simply measuring accuracy. They are focusing on evaluating AI based on the efficiency of its learning process, specifically the amount of training data and energy required. They are developing benchmarks to measure these factors, drawing parallels to human capabilities.
ARC AGI 3, with its turn-based video game format, will measure efficiency by counting the number of actions required for an AI to complete a game, comparing it to the average number of actions taken by humans. This approach aims to avoid the “brute force” solutions seen in earlier AI video game attempts (like those in 2016), which relied on massive amounts of data and computation.
What if an AI Solves ARC?
Camrad acknowledges the hypothetical scenario of an AI achieving 100% accuracy on the ARC benchmarks. He clarifies that solving ARC is a necessary but not sufficient condition for AGI. However, it would represent the most authoritative evidence to date of a system capable of generalization. The Ark Prize Foundation would then analyze the system to identify remaining failure points and continue refining the benchmark to guide future research. He concludes, “ARP, we want to put ourselves in a position when we can fully understand and be ready to declare when we do actually have AGI.”
Conclusion
The Ark Prize Foundation is pioneering a new approach to evaluating AI, shifting the focus from specialized performance to genuine generalization ability. The ARC benchmark, and its evolving iterations, represent a critical tool for measuring progress towards AGI, emphasizing efficiency, human-level performance, and the ability to learn new things – the very essence of intelligence as defined by Chollet. The foundation’s commitment to open progress and rigorous evaluation promises to play a vital role in shaping the future of AI.
Chat with this Video
AI-PoweredHi! I can answer questions about this video "How Intelligent Is AI, Really?". What would you like to know?