François Chollet: ARC-3 and the Path to AGI

Key Concepts

HGI/AGI (Human/Artificial General Intelligence): Intelligence that can understand, learn, adapt, and implement knowledge in a wide range of environments, similar to a human.
Scaling Laws: The predictable improvement in AI benchmark results as model size and training data size increase with the same architecture and training process.
Memorized Skills vs. Fluid Intelligence: Static, task-specific knowledge versus the ability to understand and solve new problems on the fly.
Test-Time Adaptation (TTA): The ability of a model to dynamically modify its behavior based on the specific data it encounters during inference.
Static Skills vs. Fluid Intelligence: The difference between having access to a collection of static programs to solve known problems versus being able to synthesize brand new programs on the fly to face a problem never seen before.
Operational Area: The scope of situations in which a skill can be applied effectively.
Information Efficiency: How much data or practice is needed to acquire a skill. Higher efficiency indicates higher intelligence.
Shortcut Rule: When focusing on a single measure of success, you may succeed but at the expense of everything else that was not captured by your measure.
Abstraction Reasoning Corpus (ARC): An AI benchmark designed to measure fluid intelligence, requiring models to solve novel problems using core knowledge priors.
Compositional Generalization: The ability to combine known concepts in novel ways to solve new problems.
Type 1 Abstraction (Value-Centric): Abstraction based on continuous distance functions, underlying perception, pattern recognition, and intuition.
Type 2 Abstraction (Program-Centric): Abstraction based on discrete program structures and isomorphisms, underlying reasoning and planning.
Discrete Program Search: Combinatorial search over graphs of operators from a defined language (DSL) to find solutions.
Kaleidoscope Hypothesis: The idea that the world's complexity arises from recombinations of a small number of fundamental "atoms of meaning."

Why Pre-Training Scaling Didn't Achieve AGI

Confusion About Benchmarks: The field was misled by benchmarks that measured memorized skills rather than fluid intelligence.
Static Inference: Pre-trained models were static at test time, unable to adapt to new situations.
Lack of Fluid Intelligence: Scaling up pre-training did not lead to genuine fluid intelligence, as demonstrated by poor performance on the ARC benchmark.
Definition of Intelligence: AI has chased task specific skill because that was our definition of intelligence. But this definition only leads to automation.

Test-Time Adaptation and Its Impact

Shift in AI Research: The AI community pivoted to test-time adaptation, enabling models to learn and adapt at inference time.
Progress on ARC: TTA led to significant progress on the ARC benchmark, indicating genuine signs of fluid intelligence.
Dynamic Behavior Modification: TTA allows models to modify their behavior dynamically based on encountered data, using techniques like test-time training, program synthesis, and chain-of-thought reasoning.
Era of Test Adaptation: AI is now fully in the era of test adaptation, moving beyond the pre-training scaling paradigm.

Defining and Measuring Intelligence

Two Views of AI:
- Minsky Style: AI is about performing tasks normally done by humans.
- MATI View: AI is about handling problems it hasn't been prepared for.
Intelligence as a Process: Intelligence is the conversion ratio between information and operational area, an efficiency ratio.
Static Skills vs. Fluid Intelligence: Static, task-specific knowledge versus the ability to understand and solve new problems on the fly.
Operational Area: The scope of situations in which a skill can be applied effectively.
Information Efficiency: How much data or practice is needed to acquire a skill. Higher efficiency indicates higher intelligence.
Problems with Exams: Human exams measure task-specific skill and knowledge, not intelligence itself. They assume humans haven't memorized questions beforehand.

The Abstraction Reasoning Corpus (ARC)

ARC1: Released in 2019, it's an IQ test for machines and humans, containing 1,000 unique tasks.
Core Knowledge Priors: ARC tasks are built on objectness, elementary physics, basic geometry, topology, and counting.
Resisting Pre-Training: ARC1 resisted the pre-training scaling paradigm, with performance staying near zero despite massive scale-up.
ARC as a Directional Tool: ARC is not the destination but an arrow pointing towards unsolved bottlenecks on the way to AGI.
ARC2: Released in March 2025, it challenges reasoning systems and focuses on compositional generalization.
Human Feasibility: ARK 2 tasks are completely doable by regular folks with no prior training.
AI Performance: Basel models like GPT4.5 and Lama 4 get 0% on ARK 2. Static reasoning systems get 1-2%.
ARC3 (Future): A significant departure from the input-output pair formats of ARC one and two. It will assess agency, the ability to explore, to learn interactively, to set goals, and achieve goals autonomously.

The Kaleidoscope Hypothesis and Abstraction

The Kaleidoscope Hypothesis: The world's complexity arises from recombinations of a small number of fundamental "atoms of meaning."
Intelligence as Abstraction: Intelligence is the ability to mine experience to identify reusable abstractions and recombine them to create models adapted to new situations.
Abstraction Acquisition and Recombination: Implementing intelligence involves efficiently extracting abstractions and recombining them on the fly.
Two Kinds of Abstraction:
- Type 1 (Value-Centric): Based on continuous distance functions, underlying perception, pattern recognition, and intuition. Transformers are great at this.
- Type 2 (Program-Centric): Based on discrete program structures and isomorphisms, underlying reasoning and planning.
Discrete Program Search: Combinatorial search over graphs of operators from a defined language (DSL) to find solutions.
Combining Type 1 and Type 2: Human intelligence excels at combining perception/intuition (Type 1) with explicit reasoning (Type 2).

The Future of AI: A Programmer-Like Meta-Learner

AI as Programmer: AI will move towards systems that approach new tasks by writing software for them.
Hybrid Models: Models will blend deep learning submodules (Type 1) and algorithmic modules (Type 2).
Deep Learning-Guided Program Search: A discrete program search system guided by deep learning-based intuition about program space.
Abstraction Library: A global library of reusable building blocks that evolves as it learns from incoming tasks.
Tendia's Approach: Leveraging deep learning-guided program search to build a programmer-like meta-learner for scientific discovery.
First Milestone: Solve RKGI using a system that starts at knowing nothing at all about RKGI.

Synthesis/Conclusion

The pursuit of AGI requires a shift from simply scaling up pre-training to developing systems capable of fluid intelligence through test-time adaptation and a deeper understanding of intelligence itself. Intelligence is not just about skill, but about the efficiency with which skills are acquired and deployed. The ARC benchmark highlights the need for models that can reason and adapt to novel situations, and future progress depends on combining Type 1 and Type 2 abstractions through deep learning-guided program search. The ultimate goal is to create AI that can independently invent and discover, accelerating scientific progress.