Back to all videos

I Forced Claude to Code for 24 Hours NONSTOP, Here's What Happened

By Cole Medin

AI Agent Development Autonomous Coding Large Language Model Applications Software Development Tools

Share:

Key Concepts

Anthropic Harness: An open-sourced coordination layer for long-running coding agents, designed to manage tasks over extended periods without overwhelming context windows by splitting work across multiple agents and contexts.
Long-Running Agents: AI agents capable of executing tasks for hours or days, envisioned as background processes for coding assistants, proof-of-concept generation, and application development.
Context Window: The limited amount of information an AI model can process at any given time. The harness addresses this by segmenting tasks.
Test-Driven Development (TDD): A methodology where success criteria (tests) are defined upfront before coding begins, ensuring continuous validation of the AI's work.
App Spec (PRD): A text file defining the scope of work and MVP requirements for an application, serving as the primary context for the initializer agent.
Initializer Agent: The first agent in the harness, responsible for setting up the project, including creating a feature list, an initialization script, and a Git repository.
Feature List JSON: A file containing hundreds of granular test cases (e.g., 200+) that must pass for the application to be considered complete. This number is configurable.
Initialization Script: A script generated by the initializer agent to spin up the website and set up the project scaffolding.
Git Repository: Essential for version control and maintaining a save state after each agent session.
Claude Progress File: A file updated at the end of each session, summarizing the work done, enabling communication between agents across different context windows.
Coding Agent: Agents that execute the core development tasks, picking up from where the previous agent left off, implementing features, and performing regression testing.
Priming: The process where a coding agent gets up to speed by understanding the existing codebase, feature list, and previous progress.
Regression Testing: Verifying that newly implemented features do not break existing functionality.
Claude Agent SDK: A programmatic interface (Python/TypeScript) for interacting with Claude models, offering greater flexibility than CLIs for building custom systems like the harness.
Puppeteer MCP Server: A tool integrated with the coding assistant that allows for visual validation of the application in a browser, including taking screenshots.
Sandbox Environment: A secure environment where coding agents operate, with restricted file operations to a defined project directory.
Hooks: Mechanisms within the SDK to manage and validate specific types of commands, such as bash commands, to prevent malicious or unintended actions.
Model: The specific AI model used (e.g., Claude Opus 4.5).
Token Count: A measure of the computational resources used by the AI model.

Harness Architecture and Workflow

The Anthropic harness functions as a sophisticated coordination layer for AI coding agents, enabling them to tackle large, long-running projects by breaking them down into manageable segments. This approach circumvents the limitations of context windows by creating a structured workflow that allows agents to pick up tasks, implement features, and validate their work iteratively.

1. Project Initialization

The process begins with an App Spec (or Product Requirements Document - PRD), which outlines the desired application features and scope. This PRD serves as the primary input for the Initializer Agent.

Initializer Agent's Role:
- Reads the App Spec to understand the project requirements.
- Generates a comprehensive Feature List JSON file. This file contains a large number of granular test cases (e.g., 200+) that define the completion criteria for the application. The number of test cases is configurable.
- Creates an Initialization Script to set up the project scaffolding and spin up the application's web server.
- Initializes a Git repository to manage version control and save states.
- Creates a Claude Progress file, which will be updated at the end of each session to summarize the work done.

2. Coding Agent Loop

Once the initializer agent completes its setup, the Coding Agents take over in a continuous loop. Each coding agent operates within a fresh context window, ensuring it doesn't get overwhelmed.

Coding Agent Workflow:
1. Priming: The agent first "primes" itself by:
  - Reading the Claude Progress file to understand what the previous agent accomplished.
  - Reviewing the App Spec and Feature List JSON to identify the next task.
  - Examining the Git history for context.
  - Spinning up the website using the Initialization Script.
2. Regression Testing: Before implementing a new feature, the agent performs regression tests to ensure that existing functionality remains intact. This is crucial for maintaining code stability as the project grows.
3. Feature Implementation: The agent selects the next feature from the Feature List JSON (marked as false for passes) and implements it.
4. Validation: The agent uses tools like the Puppeteer MCP Server to perform visual validation of the implemented feature in a browser. This includes actions like clicking buttons and waiting for elements to load, and can even involve taking screenshots.
5. Update Progress:
  - If issues are found during regression testing or feature implementation, the agent marks the corresponding feature in the Feature List JSON as false and attempts to fix it.
  - Once a feature is successfully implemented and validated, the agent updates its status to true in the Feature List JSON.
  - The agent then commits the changes to the Git repository.
  - Finally, it updates the Claude Progress file with a summary of the work completed in that session.
6. Looping: This process repeats for a predetermined number of sessions or until all test cases in the Feature List JSON pass.

Technical Implementation and Tools

The harness leverages the Claude Agent SDK for programmatic interaction with Claude models, offering significant flexibility over command-line interfaces.

Claude Agent SDK:
- Allows developers to define agents in Python or TypeScript.
- Project Directory Restriction: Agents are confined to a specified project directory, enhancing security.
- Sandbox Environment: Provides a controlled execution environment.
- Permissions: Configurable permissions, including accepting all edits without human approval for autonomous operation.
- Allowed Tools: Explicitly defines the tools and commands Claude Code can execute (e.g., reading/writing files, browser automation).
- Hooks: Custom Python scripts can be used as hooks to manage and validate specific command types, such as bash commands, preventing actions like deleting directories or operating outside the current codebase.
- System Prompt Customization: Allows for tailoring the agent's behavior and defining its model, tools, and working directory.
Puppeteer MCP Server:
- Facilitates browser automation and visual validation.
- The process of spinning up a browser, waiting for elements, and interacting with them can be time-consuming, which helps manage token usage over long runs.
Model Choice:
- The demonstration uses Claude Opus 4.5, a powerful model suitable for complex coding tasks. The choice of model can significantly impact performance and cost.

Experiment: Building a Claude.ai Clone

The video details an experiment where the Anthropic harness, integrated with Claude Code and the Claude Agent SDK, is tasked with building a functional clone of Claude.ai over 24 hours.

Objective: To assess the capabilities of long-running agents and the effectiveness of the Anthropic harness in building a complex application.
Application: A user interface for interacting with Claude, similar to Claude Desktop, with features like file uploads, project management, and conversation history.
Configuration:
- The experiment uses the Claude Agent SDK in Python.
- Instead of an Anthropic API key, the presenter uses a Claude subscription token (obtained via claude setup-token) to avoid high costs associated with prolonged runs.
- The App Spec provided with the harness repository is used, which is highly detailed to generate the extensive feature list.
- The model used is Claude Opus 4.5.

Results After 24 Hours:

Session Count: The agent reached the 54th coding agent session.
Test Passing Rate: 54% of the tests were passing, meaning over 100 features were successfully implemented.
Application Functionality: The resulting application was a "completely functional clone of claw.ai" with numerous features, including:
- Past conversation history with markdown formatting.
- Ability to create HTML pages.
- Functionality to write and execute code.
- Settings for theme changes, default model selection, and max token sliders.
- Display of token counts for responses and user prompts.
UI Imperfections: While impressive, the UI was not perfect, highlighting the need for human oversight ("human in the loop").
Alignment and Hallucination: Despite the long duration and numerous sessions, the agent remained largely aligned with the project goals, with minimal hallucination or deviation from the Claude Progress file. This is attributed to the harness's structured approach and the capabilities of Claude Opus 4.5.
Feature List Progress: The Feature List JSON showed a significant number of features marked as true for passes, with the remaining false entries focusing on finer details like scroll bars, mobile styling, and dividers.

Key Arguments and Perspectives

Legitimacy of Strategies: The presenter emphasizes that the strategies outlined in the Anthropic article and implemented in the harness are "legit" and effective for building complex AI-driven applications.
Future of Coding Assistants: Long-running agents are predicted to become a common tool for initiating coding projects, generating proofs of concept, and acting as background tasks for developers.
Power of Programmatic Control: Using SDKs like the Claude Agent SDK provides greater control and flexibility for building custom AI coding systems compared to CLIs.
Value of Open Source: The open-sourcing of the Anthropic harness is seen as a significant contribution, enabling experimentation and adoption by the community.
Importance of TDD for AI: Test-driven development is highlighted as a powerful methodology for AI coding, ensuring continuous validation and progress.
Iterative Development: The harness's iterative nature, with agents building upon previous work and performing regression testing, is key to its success in tackling complex projects.

Notable Quotes

"All a harness is is a coordination layer on top of coding agents that allows them to work for hours and hours on a task without overwhelming their context window."
"I really do think that in the near future, longunning agents are going to be something we use a lot to kick off our coding assistants as background tasks to start an application for us like build auto proof of concept and then we come in and keep building on top of it."
"The strategies here are legit."
"The real value for you is understanding how this harness works even so you can take ideas from this to evolve your own system for AI coding."
"Git is absolutely crucial for any AI coding system."
"We have true power and flexibility when we interact with cloud code directly in our Python or TypeScript code."
"I really do think that this is also the direction that we're heading with coding assistance because it's really easy to build our own systems like this harness when we control things programmatically."
"I really appreciate that regression testing is built into this."
"It's really cool how much I was able to build here without laying a finger on anything."
"I got to hand it to Anthropic here. Overall, I'm very impressed."

Conclusion and Takeaways

The Anthropic harness represents a significant advancement in enabling AI agents to undertake complex, long-running coding projects. By employing a structured approach based on test-driven development, clear artifact management (App Spec, Feature List, Claude Progress), and iterative execution, the harness effectively overcomes context window limitations. The experiment successfully demonstrated the potential of this system, resulting in a remarkably functional application clone within 24 hours. The flexibility offered by the Claude Agent SDK and tools like Puppeteer further enhances the capabilities of these autonomous coding systems. The presenter encourages developers to explore the open-sourced harness and adapt its principles to build their own AI coding workflows. The "Dynamus Agentic Coding course" is recommended as a resource for learning more about building reliable and repeatable AI coding systems.

Chat with this Video

AI-Powered

Load the transcript when you're ready to chat so the initial page stays lighter.

Related Videos

Ready to summarize another video?

Summarize YouTube Video