The only AutoResearch tutorial you’ll ever need

Key Concepts

Auto Research: An open-source framework by Andrej Karpathy that enables AI agents to perform recursive self-improvement by running autonomous experiment loops.
Recursive Self-Improvement: The process where an AI agent iteratively modifies its own code or parameters to optimize a specific metric without human intervention.
The Three-File Architecture: The core structure of an auto research loop consisting of program.md (goals/constraints), train.py (the modifiable code), and prepare.py (the immutable evaluation metric).
Time-Boxed Evaluation: A methodology where experiments are allocated a fixed time budget to ensure results are comparable and to prevent the agent from "cheating" by simply training longer.
Git-Based Version Control: Using git commit to save successful experiments and git reset to discard unsuccessful ones.

1. Overview of Auto Research

Auto Research is a paradigm shift in AI development where an agent is tasked with optimizing a specific metric by running hundreds or thousands of experiments autonomously. Instead of a human manually tuning parameters, the AI agent proposes a hypothesis, modifies the code, evaluates the result, and decides whether to keep or discard the changes based on a predefined scoring function.

2. The Three-File Architecture

To build an auto research loop, three specific files are required:

program.md: Contains the human-defined goals, constraints, and rules. It acts as the "constitution" for the agent.
train.py: The only file the agent is permitted to modify. This can contain code, configuration settings, prompts, or mathematical equations.
prepare.py: The immutable evaluation script. The agent cannot touch this file, ensuring it cannot "cheat" by rewriting the scoring function to fake better results.

3. The Experimental Loop Process

The workflow follows a strict, repeatable cycle:

Hypothesis Generation: The agent proposes a change to train.py.
Execution: The code is trained or executed for a fixed time budget (e.g., 5 minutes).
Evaluation: The prepare.py script measures the performance against the target metric.
Decision:
- If the result is better, the agent performs a git commit to save the progress.
- If the result is worse, the agent performs a git reset to revert to the previous state.
Iteration: The loop repeats indefinitely.

4. Key Arguments and Strategic Insights

The Value of Metrics: As execution becomes cheaper, the primary competitive advantage shifts to the ability to define the right metrics and constraints.
Beyond Machine Learning: While originally designed for optimizing AI models, the framework is applicable to any domain with an objective, measurable outcome, such as marketing (A/B testing), trading strategies (Sharpe ratio optimization), and software performance.
The "Final Boss" of AI: Andrej Karpathy suggests that this recursive self-improvement is the ultimate goal for frontier AI labs (OpenAI, Anthropic, etc.), moving beyond simple chatbots toward autonomous agents that perform meaningful work.

5. Practical Applications

Trading: Optimizing buy/sell rules based on market data to maximize risk-adjusted returns.
Marketing: Automating thousands of A/B tests on email copy, ad creatives, and landing pages daily.
Software Development: Automatically refactoring codebases to improve execution speed or reduce latency.
Prompt Engineering: Fine-tuning system instructions and context to improve the performance of other AI agents.

6. Limitations and Failure Points

Subjectivity: Auto research fails in areas where "better" is subjective (e.g., brand design, UX, or aesthetic choices).
Metric Misalignment: If the metric is poorly defined, the agent will confidently optimize for the wrong outcome.
Loop Speed: The evaluation must be fast and automated. If a human is required to judge the output, the loop breaks, and it ceases to be "auto" research.

7. Synthesis and Conclusion

Auto Research represents the transition from AI as a tool to AI as an autonomous researcher. By leveraging a clear, objective metric and a rigid, time-boxed loop, developers can create systems that improve themselves while they sleep. The most successful future practitioners will be those who master the art of defining clear, measurable goals and constraints, allowing AI agents to handle the heavy lifting of iterative experimentation. As noted by the presenter, we may be in the early stages of the "singularity" where recursive self-improvement becomes the standard for all high-level technical and business tasks.