Infra that fixes itself, thanks to coding agents — Mahmoud Abdelwahab, Railway

By AI Engineer

Share:

Key Concepts

  • Self-healing infrastructure: The concept of an application's infrastructure automatically detecting and fixing issues without human intervention.
  • Durable Execution/Workflows: A pattern for building reliable and simplified complex logic, where steps are automatically retried and results are cached upon success.
  • AI Coding Agent: An artificial intelligence agent designed to monitor, analyze, and fix code and infrastructure issues, often integrated with LLMs and capable of performing actions like opening pull requests.
  • Open Code: An open-source AI agent built for the terminal, offering flexibility in choosing LLM providers and a headless server implementation for programmatic interaction.
  • Railway: A deployment provider used in the demonstration, offering services for deploying applications and providing metrics and logs.
  • Metrics and Traces: Data collected from applications and infrastructure to monitor performance, identify errors, and understand behavior (e.g., CPU utilization, memory usage, request error rate, response time).
  • Pull Request (PR): A mechanism in version control systems (like Git) for proposing changes to a codebase, allowing for review and merging.

Application Infrastructure Self-Healing with AI

This demonstration outlines a system where an application's infrastructure can automatically detect and fix issues, moving beyond traditional alerting and manual investigation to a model where fixes are proposed via pull requests.

1. Problem Statement: Traditional Issue Resolution

  • Obvious Issues: Services exhibiting clear problems like memory leaks (indicated by continuously increasing memory usage) or high error rates (e.g., 94% request error rate with 500 errors) and extremely high response times (multiple seconds).
  • Subtle Issues: Services that appear fine on basic metrics (CPU, memory) but have extremely high response times due to slow database queries, leading to poor user experience (e.g., 30-second page load times).
  • Current Approach: Setting thresholds for metrics (CPU, memory, error rate) triggers alerts. However, this still requires manual investigation by developers, involving digging through logs, metrics, and traces to diagnose and fix the problem.

2. Proposed Solution: AI-Driven Self-Healing

The core proposal is to implement a "coding agent" that actively monitors the application's infrastructure. When issues are detected (thresholds are met), the agent should automatically generate and ship a fix, ideally by opening a pull request for review.

3. Workflow for Automated Fixes

The proposed system involves a series of workflows to transition from issue detection to a pull request:

  • Scheduled Monitoring Workflow:

    • Frequency: Runs on a schedule (e.g., every 10, 15, or 30 minutes).
    • Steps:
      1. Fetch Application Architecture: Understand the deployed services (frontends, backends, crons, queues).
      2. Fetch Resource Metrics: Collect CPU and memory utilization for each service.
      3. Fetch HTTP Metrics: Gather request error rates (4xx, 5xx), failed requests, and response times.
      4. Identify Exceeded Thresholds: Determine which services have surpassed predefined thresholds.
      5. Return Affected Services: Generate a list of services requiring attention.
    • Rationale for Scheduled vs. Alert-Based: Analyzing a slice of time is preferred over immediate alert-based triggers to avoid noise from spiky workloads that might temporarily exceed thresholds but not indicate a persistent issue.
  • Contextual Data Gathering:

    • Once affected services are identified, more context is pulled:
      • Project Health: Overall status of all services.
      • Service-Specific Context: If a service is flagged, gather additional details.
      • Log Analysis: Check logs for errors, even if resource utilization seems high but legitimate (e.g., due to high usage).
      • Code Scanning & Dependency Analysis: Infer upstream providers and check their status pages (e.g., a payment processor outage). This can inform the agent to suggest waiting rather than fixing.
  • Detailed Plan Generation:

    • Synthesize all gathered information (e.g., high 500 requests, high memory utilization, specific endpoint errors) into a detailed plan. This plan outlines the application architecture, affected services, and potential root causes.
  • AI Agent Execution:

    • The detailed plan is handed to an AI agent.
    • Agent Actions:
      1. Clone the repository.
      2. Create a to-do list based on the plan.
      3. Implement the necessary fixes.
      4. Create a pull request.

4. Core Technologies and Methodologies

  • Durable Execution (Workflows):

    • Concept: An abstraction that simplifies complex logic and enhances reliability.
    • Features:
      • Automatic Retries: Steps prone to failure are automatically retried by default.
      • Customizable Retry Logic: Options for exponential backoff or defining specific failure actions.
      • Result Caching: Successful steps' results are cached, preventing redundant work upon retries, leading to faster execution and cost savings.
    • Application: Used to orchestrate API calls to Railway for fetching architecture, metrics, and logs, and to trigger the fix generation process.
  • AI Coding Agent (Open Code):

    • Description: An AI agent built for the terminal, serving as an open-source alternative to tools like Cloud Code.
    • Key Features:
      • LLM Agnostic: Allows users to choose any LLM provider or model.
      • Terminal UI: Provides a user interface within the terminal.
      • Headless Server Implementation: Exposes an API for programmatic interaction, enabling deployment on servers (like Railway).
    • Architecture: When open code is run, it starts both a terminal UI (client) and a server. This server can be run independently on a remote machine, allowing custom clients to interact with it.
    • Deployment on Railway:
      • A Dockerfile is used to define the environment.
      • Installs essential tools: curl, jq, bash, git, GitHub CLI.
      • Installs Open Code.
      • Configures Git.
      • Authenticates the GitHub CLI for opening pull requests.
      • Exposes the necessary port.

5. Demonstration and Practical Implementation

The demonstration showcases the end-to-end process using a project named "railway autofix."

  • Project Structure: Consists of an API directory and an Open Code directory.
  • Open Code Server: A single server running using bun, exposing an API on port 40009496.
  • API: Runs on localhost:3000, with a UI provided by ingest for debugging workflows.

Workflow Execution Example:

  1. monitor_project_health Workflow:

    • Get Project Architecture: Fetches details about databases, services, their repositories, configurations, and volumes.
    • Fetch Resource Metrics (Parallel): Retrieves CPU and memory utilization for services and databases.
      • Example Metric: Database CPU utilization averages 0.93 vCPU, with a max of 0.9. Memory usage is 31.96 GB out of a 32 GB max, indicating high utilization.
    • Fetch HTTP Metrics (Parallel): Collects error rates (400s, 500s) and latency for each service.
    • Summarize Metrics: Formats results for the coding agent.
    • pull_service_context Function: Receives all gathered metrics and architecture information.
      • Fetches HTTP logs, build logs, and deployment logs for affected services.
      • Provides an architecture_summary in a readable format.
      • Passes this comprehensive data to the generate_fix workflow.
  2. generate_fix Workflow:

    • AI Analysis: The collected data is analyzed by an LLM.
    • Plan Generation: The AI creates a plan, including debugging steps (e.g., "reproduce locally with the same load") and recommendations.
    • Session Creation: A session is created for the coding agent, potentially per repository.
    • Fix Implementation & Pull Request: The agent executes the plan, implements fixes, and opens a pull request.

6. Outcome: Automated Pull Request

  • Pull Request Created: A pull request is automatically opened in the GitHub repository.
  • PR Conversation: Includes a summary of changes, analysis, root causes, and what was fixed.
  • Review and Merge: Developers can review the PR, and if satisfactory, merge it to deploy the fix.

7. Conclusion and Next Steps

The demonstration successfully illustrates a system where application infrastructure can self-heal by leveraging durable workflows for data collection and an AI coding agent for automated fix generation and pull request creation. The code for this project will be made available.

Contact: For questions, reach out on X (Twitter).

Chat with this Video

AI-Powered

Load the transcript when you're ready to chat so the initial page stays lighter.

Related Videos

Ready to summarize another video?

Summarize YouTube Video