Building Cursor Composer – Lee Robinson, Cursor

Key Concepts

Cursor Composer: Cursor's first agent model, designed for real-world software engineering, balancing speed and intelligence.
Agent Model: An AI model capable of performing tasks autonomously by interacting with tools.
Token Generation Efficiency: A measure of how many tokens an AI model can generate per unit of computational resource or time.
Reinforcement Learning (RL): A machine learning paradigm where an agent learns to make decisions by taking actions in an environment to maximize a reward.
Tool Calling: The ability of an AI model to invoke external functions or services (tools) to perform specific actions.
Parallel Tool Execution: The capability of an agent to call multiple tools simultaneously, rather than sequentially.
Semantic Search: A search technique that understands the meaning and context of queries, going beyond keyword matching, often powered by embedding models.
Mixture of Experts (MoE) Model: A type of neural network architecture that combines multiple "expert" sub-networks, each specializing in different aspects of the data.
Custom Kernels: Specialized, optimized code segments designed for specific hardware or computational tasks, often used to accelerate machine learning operations.
Low-Precision Training: Training machine learning models using numerical representations with fewer bits, which can speed up computation and reduce memory usage.
Inference Server: A server dedicated to running trained AI models to generate predictions or perform tasks.
Environment Servers: Servers that simulate the operational environment for an AI agent, allowing for training and testing.
Rollouts: In RL, a sequence of actions taken by an agent in an environment, used to gather data for training.
Load Balancing: Distributing computational workload across multiple threads or processes to optimize resource utilization and prevent bottlenecks.
Cloud Agents Product: A feature allowing users to run Cursor agents offline, on the web, or via integrations like Slack, often utilizing virtual machines.
Semi-Async Valley of Death: A term describing the frustrating middle ground where AI agent performance is neither fast enough for immediate interaction nor powerful enough for significant background processing.

Cursor Composer: Building a Fast and Smart Software Engineering Agent

This presentation details the development of Cursor Composer, Cursor's inaugural agent model, emphasizing its design for real-world software engineering with a focus on both speed and intelligence.

Performance Benchmarks and Goals

Performance: Composer is benchmarked against open-source models, outperforming the best ones. While slightly below the latest frontier models like Sonnet 45 and GPT 5.1 codecs, it achieves approximately four times greater efficiency in token generation compared to models of similar intelligence.
Core Objective: The primary goal was to mesh speed with intelligence, creating a model that is both fast and capable of complex reasoning for software development.

Motivation for Building Composer

Evolution from Tab: Cursor's existing IDE already utilizes a model called "Tab" for autocomplete. The team aimed to apply a similar low-latency approach to coding with agents.
User Feedback: Early prototypes, nicknamed "cheetah slug," were well-received for their speed, but users indicated they were "not really smart enough yet to be a daily driver." This feedback highlighted the critical need for enhanced intelligence.
Internal Benchmark: To address the intelligence gap, Cursor developed an internal benchmark reflecting their own repository usage and software development practices. The aim was to create a model that developers would use daily as a "checkpoint."
Key Enablers: Significant improvements were achieved through the ability to call tools in parallel and the effective utilization of a semantic search tool.

Composer in Action: A Demo

Cursor 2.0: The presentation showcases Composer within Cursor 2.0, demonstrating its rapid execution.
Parallel Tool Calls: Composer is shown making numerous tool calls concurrently, including reading files, executing shell commands, making file edits, and managing to-do lists.
User Experience: The agent's speed allows users to "keep you in the flow" by quickly working through tasks, contrasting with the longer wait times of traditional agents. This offers a "different programming experience."

Technical Approach and Challenges

Agent Workflow: A user query is submitted to the backend. The agent interprets the query and decides on a series of tool calls.
Available Tools: Composer has approximately 10 tools, with a focus on five key ones:
- Reading files
- Editing files
- Searching codebase
- Analyzing lints
- Running terminal/shell commands
Autonomous Decision-Making: The agent autonomously determines whether to execute tools serially or in parallel.
Reinforcement Learning (RL) for Training: The goal is to mirror the Cursor production environment as closely as possible during RL training. This involves:
- Rollouts: Running sequences of tool calls (e.g., reading and editing files, or codebase search) from the same starting point.
- Scoring and Updating: Evaluating the output of different rollouts, selecting the better ones, and updating the model's parameters accordingly.

Infrastructure and ML Challenges

The scaling of this RL approach presents three primary challenges, which are largely infrastructure-related:

Matching Training and Inference Environments:
- Problem: Training a large Mixture of Experts (MoE) model on thousands of GPUs requires significant speed-up to be feasible. The training and sampling versions must be closely aligned.
- Solution: Focus on optimizing the speed of the training process to match inference.
Complexity of Rollouts:
- Problem: Real-world rollouts can involve hundreds of thousands to millions of tokens and hundreds of tool calls, with varying completion times.
- Solution: Develop mechanisms to handle the diverse durations and complexities of individual rollouts.
Consistency and Compute Spikiness:
- Problem: Mimicking the production environment requires exact tool format and response consistency. However, training involves "bursty" compute, unlike the more standard inference at production.
- Solution: Address the infrastructure challenges arising from the difference between bursty training compute and steady-state inference.

Infrastructure Architecture and Solutions

The architecture comprises three main server types:

Inference Server: Runs trained models and executes rollouts using frameworks like Ray.
ML Stack (PyTorch): The core machine learning framework.
Environment Servers: Simulate the Cursor environment, allowing agents to interact with code and tools.

Key Infrastructure Solutions:

Custom Kernels for Low-Precision Training:
- Benefit: Developed by the research team, these custom kernels significantly speed up the training process and simplify deployment to inference servers.
- Impact: Achieved approximately 3.5 times speed-up for MoE layers on Nvidia Blackwell chips.
Load Balancing for Inference Servers:
- Problem: Rollouts complete at different times, leading to potential idle time.
- Solution: Implemented load balancing across threads and processes to shift work dynamically, preventing idle resources. This ensures that if one rollout is lengthy (e.g., installing packages), others are not held up.
Co-design of Models and Products:
- Benefit: Having both the coding agent (IDE) and model research teams working together allows for co-design.
- Example: The development of RL work for Composer coincided with the building of the cloud agents product. This product spins up virtual machines (VMs) in the cloud to load user code, allowing agents to make file changes and run tools in a secure sandbox.
- Synergy: These cloud VMs serve as an ideal infrastructure for RL training, closely matching the production Cursor environment.
VM Orchestration:
- Challenge: The spiky nature of training workloads required infrastructure to support and orchestrate hundreds of thousands of VMs across multiple clusters.
- Visualization: An internal dashboard was built using Composer to visualize the VM fleet.

The Power of Semantic Search

Tool Integration: A key benefit of closely matching the production environment is the ability to provide the model with valuable, integrated tools.
Cursor's Embedding Model: Cursor has trained its own embedding model for semantic search. This allows agents to use natural language queries to find relevant files within a codebase.
Impact on Models: Research showed that semantic search benefited every model within the Cursor agent harness.
Composer's Advantage: Composer, trained in the same environment, became a "power user" of semantic search, making it particularly effective.

Release and Future Directions

Continuous Improvement: The RL training process demonstrated continuous model improvement with increased compute and rollouts.
Performance Trajectory: Composer started at the performance level of the best open models and has progressed to near-frontier capabilities for coding agents.
Scalability of RL: This success is seen as a positive indicator for scaling RL to other specialized tasks beyond coding.
Behavioral Improvements: RL enabled Composer to:
- Increase Speed: Read 10 files in parallel, making the end-to-end experience faster.
- Improve Agent Behavior: Reduced unnecessary edits, learned to prioritize searching and reading files before making changes, becoming more effective overall.
User Reception: Composer was released in Cursor 2.0 last month and has received positive feedback, with some users reporting it has "brought a lot of joy back to coding with agents."
The "Airplane Wi-Fi" Analogy: The speaker likens the experience of early coding agents to frustratingly slow airplane Wi-Fi – functional but not ideal. Composer aims to bridge the gap, offering a more synchronous, in-the-loop experience akin to writing code by hand, avoiding the "semi-async valley of death."
Hybrid Workflow: A common daily workflow involves using high-frontier models (like GPT 5.1 codec) for planning and then Composer to execute those plans.

Reflections and Takeaways

Effectiveness of RL: Reinforcement Learning can be highly effective for training specialized models with high-quality data and sufficient compute, particularly for coding tasks. Cursor's focus is on building "very good coding models," not general intelligence (AGI).
Impact of AI Tools: Tools like Cursor significantly accelerate R&D by improving code writing and debugging efficiency, leading to faster iteration and product shipping.
Infrastructure as ML: A significant portion of ML work and training processes are fundamentally infrastructure problems, highlighting the strong correlation between the two. This mirrors observations from previous work in frameworks like Vercel.

Hiring and Conclusion

Cursor is actively hiring across various roles, with a new office in New York, to continue building leading coding models.