AI Engineer Code Summit 2025: AIE/CODE Track
By AI Engineer
AI Engineering Code Summit 2025: Key Takeaways and Insights
This summary provides a detailed overview of the AI Engineering Code Summit 2025, focusing on the main topics, key points, methodologies, arguments, and technical details presented by the speakers.
Key Concepts:
- The "New Code": A paradigm shift in how humans interact with and create software, moving beyond traditional coding to a more intuitive, soul-driven, and collaborative process with AI.
- "No More Slop": A central theme emphasizing the need for quality, authenticity, and accuracy in AI-generated content and code, combating low-quality output.
- Agent Skills: A new paradigm for packaging procedural knowledge for agents, moving away from monolithic agents to composable, reusable "skills" (organized folders with scripts).
- Context Engineering: Advanced techniques for managing and optimizing the information provided to LLMs to improve their performance, especially in complex, brownfield codebases.
- Cursor Composer: A fast and intelligent coding model developed by Cursor, optimized for real-world software engineering tasks.
- Reinforcement Learning (RL) for Agents: Techniques for training agents to improve their performance through feedback and reward signals, with a focus on efficiency and scalability.
- Code World Models (CWM): Models that reason, plan, and make decisions by predicting program execution and program states, aiming to understand code beyond just syntax.
- Evaluation of AI Models: The critical importance of robust, dynamic, and contamination-resistant evaluation methodologies for assessing AI capabilities in coding and agentic tasks.
- Vibe Coding/Vibe Engineering: A more intuitive, less code-centric approach to interacting with AI for software development, emphasizing collaboration and leveraging AI for rapid prototyping and iteration.
- Agentic Development Platforms: New IDEs and platforms designed from the ground up to integrate AI agents seamlessly into the software development lifecycle.
- Poolside Models: Proprietary models built from scratch, focusing on reinforcement learning and aiming to close the gap between AI and human intelligence in knowledge work.
- Prompt Learning: A technique for improving AI agent performance by iterating on system prompts based on feedback and evaluations, offering an alternative to traditional RL.
- Agent Environments: Standardized, containerized systems for evaluating and training AI agents, crucial for creating robust and scalable AI development workflows.
- Anti-gravity IDE: Google DeepMind's first agent-first IDE, integrating an editor, browser, and agent manager to provide a comprehensive AI-powered development experience.
Main Topics and Key Points:
1. The Future of AI and Coding: The "New Code" and "No More Slop"
- Main Argument: The conference theme revolves around a paradigm shift in software development, moving towards a "New Code" where AI agents are integral collaborators. A strong emphasis is placed on combating "slop" – low-quality, inauthentic, or inaccurate AI-generated content.
- Key Points:
- The transition from instructing AI to collaborating with it, where the "interface bleeds with the soul."
- The need to elevate quality and "taste" in AI output, as highlighted by Swix's "war on slop."
- The idea that AI can be used to fight slop, through better prompting and curated outputs.
- The concept of "autonomy without accountability" being sloppy, emphasizing the need for responsible AI development.
- Supporting Evidence: Swix's keynote, which framed the conference's core mission and introduced the "no more slop" mantra.
2. Evolving Agent Architectures: From Agents to Skills
- Main Argument: The industry is moving from building monolithic, use-case-specific agents to a more modular approach using "skills" – composable collections of files and scripts that provide specialized procedural knowledge.
- Key Points:
- Agents initially were thought to be domain-specific, requiring unique tools and scaffolding.
- Code is recognized as a universal interface, enabling a more general-purpose agent architecture.
- "Skills" are presented as organized folders with scripts as tools, making them accessible to humans and agents alike.
- The ecosystem of skills is rapidly growing, with foundational, third-party, and enterprise-specific skills emerging.
- Skills can be progressively disclosed to manage context windows and allow for composability.
- Examples: Anthropic's "Cloud Code" agent, document skills, scientific research skills (Cadence), browser automation skills (Browserbase), Notion workspace skills.
- Methodology: Shifting from building agents to building skills, emphasizing modularity and composability.
- Speakers: Mahesh Morog and Barry Jeang (Anthropic).
3. Mastering Context Engineering for Effective AI Coding Agents
- Main Argument: Effective use of AI coding agents, especially in complex, brownfield codebases, hinges on advanced context engineering techniques to manage the LLM's context window and guide its behavior.
- Key Points:
- LLMs are stateless, and performance is directly tied to the quality of tokens in the context window.
- Naive interaction with coding agents involves iterative correction, often leading to context exhaustion.
- "Intentional compaction" (compressing context into markdown files) and "sub-agents" (for controlling context in specific tasks) are key strategies.
- The "dumb zone" concept highlights diminishing returns as context window usage increases, emphasizing the need to stay in the "smart zone."
- The "Research, Plan, Implement" (RPI) workflow is presented as a structured approach to context engineering, prioritizing planning and clear specifications.
- Mental alignment within teams is crucial, and detailed plans can facilitate this better than raw code reviews.
- Examples: Dex Hory's personal experience refactoring a 300,000-line Rust codebase and shipping 35,000 lines of code to BAML using the RPI workflow.
- Methodology: Intentional compaction, sub-agents, RPI workflow, staying in the "smart zone" by managing context effectively.
- Speaker: Dex Hory (Human Layer).
4. Building Cursor Composer: Infrastructure, Training, and Evaluation
- Main Argument: Cursor developed its own coding model, "Composer," by focusing on efficient RL training, matching training and inference environments, and leveraging custom infrastructure to achieve both speed and intelligence.
- Key Points:
- Composer is designed to be fast and smart, outperforming open-source models and competing with frontier models while being more token-efficient.
- The development involved training a large mixture of experts model, parallelizing across thousands of GPUs, and using custom kernels for low-precision training.
- Challenges included matching training and inference environments, handling complex rollouts, and ensuring consistency.
- Asynchronous RL with pipeline RL and inflight weight updates were key to efficient training.
- Semantic search, powered by a custom embedding model, significantly improved agent performance, especially for Composer.
- RL allowed for tuning model behavior, such as parallel tool calling and improved search strategies.
- Technical Terms: Mixture of Experts (MoE), Reinforcement Learning (RL), custom kernels, NVIDIA Blackwell chips, Ray, PyTorch, KV cache, semantic search, inflight weight updates.
- Speaker: Lee Robinson (Cursor).
5. Evaluating AI Coding Models: From Benchmarks to Real-World Tasks
- Main Argument: Evaluating AI coding models requires a multi-faceted approach, moving beyond static benchmarks to dynamic evaluations, robust grading, and considering real-world task complexity and potential for "reward hacking."
- Key Points:
- Data Contamination: A significant challenge where models are trained on internet data, including benchmark problems. Dynamic evaluation sets (e.g., problems released after model training) help combat this.
- Insufficient Test Suites: Brittle test cases can lead to incorrect solutions passing. Generating diverse tests (fuzzing) is crucial.
- Difficulty Distributions: Benchmarks need to provide a good signal for progress, avoiding problems that are too easy or too hard.
- Real-World Software Optimization: Evaluating models on tasks like optimizing C/C++/Rust codebases requires construct validity (measuring what's intended) and reliable grading.
- Reward Hacking: Models can exploit evaluation infrastructure or test distributions. Techniques like LLM judges (e.g., GPT-5) are proposed to detect non-idiomatic coding patterns and adversarial behaviors.
- Intermediate Grading Signals: For long-horizon tasks, measuring incremental progress (e.g., fraction of code translated) is important.
- Human-Centric Experiment Design: In "in the wild" evaluations (e.g., Copilot Arena, Repo Chat), understanding human behavior, especially latency tolerance, is critical.
- Examples: LiveCodeBench (competition programming), software optimization benchmarks (llama.cpp), Zafle codebase translation, Copilot Arena, Repo Chat.
- Methodology: Dynamic evaluations, automated test case generation, LLM judges for reward hacking detection, intermediate grading signals, human-centric experiment design.
- Speaker: Nian Jane (Cursor).
6. World Models for Computation: Modeling the World of Code
- Main Argument: Building "world models" for computation, specifically by modeling program execution traces, can lead to better AI reasoning, planning, and decision-making capabilities in code-related tasks.
- Key Points:
- Code World Model (CWM): A 32 billion parameter dense transformer trained on trillions of tokens, focusing on predicting program execution.
- Modeling Code: Moving beyond syntax to explicitly model execution, creating structured representations of programs and their states.
- Execution Tracing: Generating line-by-line traces of program execution, including local variables and memory states, which can be fed to models.
- Simulating Actions: World models allow for simulating actions (e.g., executing code) without actual execution, leading to more efficient agentic workflows.
- Bash-Oriented Model: CWM is designed to be bash-oriented, learning to use the terminal effectively for tasks.
- Synchronous RL: A highly synchronous RL setup with queues for models and trajectories is used to achieve high throughput during post-training.
- Neural Debugger: CWM's ability to trace execution can power a neural debugger, helping users compose code and express desired program structures.
- Halting Problem Approximation: CWM's simulation capabilities might offer ways to approximate solutions to fundamental computer science problems like the halting problem.
- Technical Terms: Code World Model (CWM), execution traces, transition function, program states, bash, synchronous RL, neural debugger, halting problem.
- Speaker: Jacob Kahn (Meta AI).
7. Efficient Reinforcement Learning for Enterprise AI
- Main Argument: Enterprises need efficient and scalable RL techniques to build specialized AI systems that deliver quantitative ROI, moving beyond the large-scale, multi-week training runs typical of research labs.
- Key Points:
- Applied Compute's Mission: Helping enterprises build their own intelligence for real work, moving AI beyond productivity to automation.
- RL for Specialization: RL is used to adapt models to specific enterprise use cases and out-of-distribution data.
- Inefficiency of Synchronous RL: Synchronous RL leads to idle GPUs due to straggler samples, resulting in low utilization and high costs.
- Asynchronous RL (Pipeline RL): Dedicating GPUs to sampling and training separately, allowing for inflight weight updates and improved efficiency.
- Staleness Trade-off: Increasing staleness (allowing older model weights in samples) improves efficiency but can lead to learning instability due to high variance in importance ratios.
- Systems Modeling: Optimizing RL efficiency requires modeling compute budget (GPUs), training batch size, sampling throughput (latency per GPU), and training throughput (tokens per second per GPU).
- Simulation: Modeling workloads and simulating different configurations (GPU allocation, batch sizes) before running expensive training jobs.
- Methodology: Shifting from synchronous to asynchronous RL, pipeline RL, systems modeling for optimization, simulation of workloads.
- Speakers: Rhythm Gard and Lynden Lee (Applied Compute).
8. Scaling RL Environments: The Prime Intellect Approach
- Main Argument: Increasing the accessibility of AI research is key to scaling innovation. Prime Intellect focuses on building an "open super intelligence stack" with a strong emphasis on accessible RL environments and tooling.
- Key Points:
- Talent Bottleneck: The difficulty in finding and retaining AI researchers necessitates increasing the pool of AI talent by making research more accessible.
- Open Source Ecosystem: Parallels drawn between AI research and open-source software ecosystems (Linux, Node) to emphasize compounding abstractions and best practices.
- Environments as Entry Points: RL environments are seen as the "web apps of AI research" – simple, self-contained, pedagogical, and requiring experimentation.
- Environments Hub: An open-source platform for creating, discovering, and sharing RL environments and evaluations.
- Verifiers Toolkit: A library for building environments, supporting various tasks like tool use, sandboxing, and agent frameworks.
- Hierarchical Design: Environments are designed hierarchically, from foundational pieces to complex applications, prioritizing extensibility.
- Prime RL Trainer: A large-scale training stack incorporating best practices for asynchronous RL.
- Community Focus: Emphasizing community contribution and feedback loops to improve tooling and research accessibility.
- Prime Intellect 3: A large model trained on 500 GPUs, validating the efficiency and performance of their stack.
- Examples: Wiki Search environment, Verifiers toolkit, Prime RL Trainer, Prime Environments repo.
- Methodology: Building an open-source ecosystem, creating accessible RL environments, hierarchical design, community collaboration.
- Speaker: Will Brown (Prime Intellect).
9. OpenAI's Agent Reinforcement Fine-Tuning (RFT) for Code Models
- Main Argument: Agent RFT is a powerful technique for enhancing AI agent performance by fine-tuning model weights based on custom reward signals, enabling agents to interact with the outside world during training.
- Key Points:
- Agent RFT Benefits: Improves reasoning models, is sample-efficient (success with as few as 10 examples), results in lower latency, and better task performance.
- Domain Shift Mitigation: RFT helps adapt models to specific business contexts and tool usage through weight adjustments.
- Tool Interaction During Training: The first time OpenAI allows models to interact with the outside world (via endpoints) during training.
- Systems-Level Integration: Unique identifiers (UUIDs) associate tool calls with specific rollouts for holistic grading.
- Recommended Process: Grounding in a baseline, optimizing with prompt/task engineering, and then applying RFT for further gains.
- Customer Spotlights:
- Cognition: Improved code edit planning (F1 score reward), learned parallel tool calls, highlighting the importance of data quality and volume.
- Kodo: Enhanced deep research agent (recall reward), reduced tool calls, stabilized behavior by eliminating long-tail cases.
- Cosign: Trained agents for enterprise codebases with strict grading, achieving state-of-the-art performance and faster agents.
- Macco: Trained agents for GPU kernels (PyTorch prompts, custom reward function), overcoming reward hacking and achieving significant speedups.
- Key Principles for Success: Well-defined tasks, matching train/eval data to production, ensuring models can learn from their own exploration (variance), and non-hackable, continuous reward functions.
- Methodology: Agent RFT, custom reward signals, UUIDs for tracking, phased adoption (baseline, prompt/task optimization, RFT).
- Speakers: Will Hang and Kathy Zhao (OpenAI).
10. The Future of Front-End Engineering: Vibe Engineering and Agent Collaboration
- Main Argument: The rise of AI agents is transforming front-end development, shifting the focus from meticulous code optimization to "vibe engineering" – a more intuitive, collaborative process of guiding AI agents and curating their output.
- Key Points:
- Evolution of Front-End: Recap of front-end development trends since 2017, highlighting increasing complexity and the struggle with basic styling and browser compatibility.
- LLMs and React: LLMs are adept at writing React code, but the "abstraction" layer can sometimes lead to complexity. LLMs don't care about repetitive code, which can be a benefit.
- Vibe Coding vs. Vibe Engineering: Distinguishing between passively accepting AI output ("vibe coding") and actively guiding and curating it ("vibe engineering").
- The Role of the Developer: Developers are shifting from writing code to curating the AI's output, managing context, and ensuring quality.
- The "Pain in the Ass" Developer: Identifying developers who resist AI adoption due to a focus on micro-optimizations and a lack of acceptance of new paradigms.
- Skill Issue: Effective AI collaboration requires a blend of technical knowledge, prompt engineering, understanding model limitations, and being "chronically on Twitter" for the latest trends.
- Composer One's Impact: The model's speed and responsiveness have made "vibe engineering" more interactive and less like waiting for a slow process.
- Abstraction vs. Simplicity: A caution against over-abstraction just because it's possible, emphasizing the need for simplicity and understanding.
- Job Market Impact: AI agents may displace junior roles, but senior engineers who can effectively collaborate with AI and maintain legacy systems will remain valuable.
- Examples: Kitsy's personal projects (Sizzy, Life OS, Glink), refactoring projects using Composer One, the concept of "vibe engineering" prompts.
- Methodology: Vibe engineering, leveraging AI for rapid prototyping, focusing on intuition and collaboration, understanding the limitations and capabilities of AI models.
- Speaker: Kitsy.
11. Google's AI Studio and Gemini 3: Democratizing Software Creation
- Main Argument: Google's AI Studio, powered by Gemini 3 and Nano Banana Pro, aims to democratize software creation by making it easier for anyone to build AI-powered applications through intuitive "vibe coding" and seamless integration of advanced models.
- Key Points:
- Gemini 3 Pro: Google's latest state-of-the-art model, excelling in UI/aesthetic sensibilities and agentic tool calling.
- Nano Banana Pro: An advanced image model powered by Google Search, capable of generating detailed infographics, improving text rendering, and offering creative controls.
- AI Studio: A platform for building AI-powered apps with a gallery of examples, free usage, and easy integration of Gemini APIs.
- AI Chips: Features that enhance Gemini API functionality, including Google Search grounding and live API integrations.
- Full-Stack Runtime: Upcoming support for backend development, enabling the creation of full-stack applications with a single prompt.
- Agentic IDE (Anti-gravity): A new IDE that integrates with AI Studio, allowing for seamless migration and extension of AI-powered applications.
- Democratizing Creation: The goal is to empower anyone to build software, rethinking traditional paradigms and making development more intuitive.
- Examples: Laptop sticker generation, comic book story creation, slick animation website generation, AI Studio UI cloning for Anti-gravity, multiplayer racing game development.
- Methodology: Vibe coding, leveraging AI for UI/UX design and full-stack development, integrating multimodal capabilities.
- Speakers: Cat Conf and Amarresi (Google DeepMind).
12. Factory's Approach: Agent-Ready Codebases and Validation Criteria
- Main Argument: The effectiveness of AI agents in software engineering is heavily dependent on the underlying codebase's validation criteria. Organizations need to invest in rigorous automated validation to unlock the full potential of AI agents.
- Key Points:
- Autonomy via Verification: Shifting from specification-driven development to automation via verification, where AI searches for solutions based on defined objectives.
- Asymmetry of Verification: Many tasks are easier to verify than to solve, making them ideal for AI.
- Codebase Validation: The importance of automated validation for code format (linters), correctness (tests), and adherence to architectural patterns.
- AI Agents and Validation: AI agents break when validation is lacking. Rigorous validation enables more complex AI workflows like parallel agent execution and large-scale refactoring.
- Developer Role Shift: Developers are becoming curators of the AI development environment, setting constraints and building automations.
- Investing in Environment: Organizations should invest in feedback loops and validation criteria to enhance AI agent capabilities, leading to significant velocity gains (5x-7x).
- The "Slop Test": Even a basic test that catches some errors is better than no test, as it provides a foundation for improvement.
- Methodology: Analyzing codebase validation across eight pillars, improving opinionated linters and tests, using AI to identify and fix validation gaps.
- Speaker: Eno Reyes (Factory).
13. SourceGraph's AMP: An Opinionated Frontier Agent
- Main Argument: SourceGraph's AMP agent is built with an opinionated approach, focusing on a custom toolset, sub-agents for context management, and distinct agent modalities (smart vs. rush) to provide a unique and effective AI coding experience.
- Key Points:
- Opinionated Frontier Agent: AMP embraces the "weird and magical" aspects of AI in coding, aiming to live a year in the future.
- Custom Toolset: Prioritizing a refined, custom toolset over generic MCP integrations to enable better feedback loops and reduce context confusion.
- Sub-Agents: Utilizing specialized sub-agents (Finder, Oracle, Librarian, Kraken) for code search, reasoning, external context retrieval, and large-scale refactoring to conserve and extend context windows.
- Agent Modalities: Offering two top-level agents: "Smart Agent" (for complex tasks with sub-agents) and "Rush Agent" (for in-the-loop, quick edits).
- UI/UX: Providing both terminal and editor interfaces, with a focus on efficient diff review and guiding users through agent output.
- Economic Accessibility: Implementing a subtle ad network to sponsor inference costs for the "Rush Agent," making it more accessible.
- Community Building: Fostering a community of builders focused on experimenting with AI agents and pushing the frontier of what's possible.
- Technical Terms: Sub-agents, context window management, Oracle (reasoning sub-agent), Kraken (code mod sub-agent), model selector vs. agent-oriented architecture, TUI framework, diff viewer, community of builders.
- Speaker: Banglu (SourceGraph).
14. Gimlet Labs: AI-Generated Kernels for PyTorch Optimization
- Main Argument: AI can be used to automatically generate optimized low-level kernels for PyTorch workloads, speeding up agentic inference clouds and addressing the shortage of kernel optimization experts.
- Key Points:
- Agentic Inference Cloud: Gimlet Labs builds platforms that orchestrate heterogeneous compute for agentic workloads.
- Kernel Optimization: Low-level kernel optimization can significantly improve ML workload performance (e.g., 3x throughput for Llama models).
- Expert Shortage: A lack of kernel optimization experts makes this a bottleneck for many AI applications.
- AI for Kernel Synthesis: Using AI to automatically port and optimize kernels for different hardware platforms (CUDA, Triton, Metal).
- Challenges: Defining correctness for floating-point operations, reliable performance measurement (avoiding launch time vs. execution time), handling hardware-specific characteristics, and preventing reward hacking.
- Preliminary Results: Achieved average speedups of 25% on M4 using Metal with the KernelBench dataset, with sweet spots in moderately complex problems.
- Case Studies: Kernel fusion (40% speedup), rewriting PyTorch code to use more optimized ops (80% speedup), matrix multiplication failure, and a case of accidental optimization (71,000x speedup due to pruning unnecessary work).
- Human in the Loop: Emphasizing the need for human supervision to guide optimization, validate results, and handle complex or novel scenarios.
- Agentic Architecture: Supervisor agent, synthesis agent swarm, verification agent (hardware-in-the-loop), and human prompting for optimization.
- Methodology: AI-driven kernel synthesis, hardware-in-the-loop verification, focusing on fusion, op rewriting, and addressing hardware specifics.
- Speaker: Natalie Serino (Gimlet Labs).
15. Netflix's Lessons: Direction Over Speed in AI-Assisted Development
- Main Argument: The ease of AI code generation can lead to a "software crisis of complexity" if developers don't prioritize understanding and simplicity over speed. The key is to invest in thinking and planning, not just generation.
- Key Points:
- Confusing Easy with Simple: AI makes the path of "easy" (quick generation) frictionless, but it doesn't inherently create "simple" (understandable, maintainable) code.
- Complexity Compounding: Choosing easy over simple leads to accumulated complexity that AI, without human guidance, can exacerbate by preserving all patterns, including technical debt.
- Essential vs. Accidental Complexity: AI struggles to distinguish between the fundamental problem complexity and the added complexity from workarounds and past decisions.
- The Three-Phase Approach:
- Research: Gathering context (docs, diagrams, Slack threads), analyzing the codebase, and correcting AI analysis.
- Plan: Creating a detailed, step-by-step implementation plan with clear specifications, function signatures, and architectural decisions.
- Implementation: Executing the plan with AI, focusing on verification against the plan rather than understanding emergent complexity.
- Human Checkpoint: The research phase requires human validation to ensure accuracy and prevent disasters.
- Knowledge Gap: AI generation speed outpaces human understanding, leading to a loss of the ability to recognize problems and a decline in critical thinking skills.
- The Future of Developers: Developers will focus on curating the AI environment, setting constraints, and ensuring the quality and maintainability of the software.
- Methodology: The three-phase approach (Research, Plan, Implement), prioritizing thinking and planning, manual migration to gain context, and using AI to accelerate mechanical tasks.
- Speaker: Jake Nations (Netflix).
16. Reconciling Benchmark Performance with Real-World Developer Productivity
- Main Argument: While benchmarks show impressive AI capabilities, real-world field experiments reveal that AI tools can sometimes slow down highly experienced developers working on complex, messy, and long-context tasks due to over-reliance, reliability issues, and the difficulty of eliciting optimal AI behavior.
- Key Points:
- Benchmark vs. Economic Evidence: A clash exists between benchmark scores (showing rapid AI progress) and real-world productivity studies (showing potential slowdowns).
- Human Baseline Data: Measuring AI performance against expert humans on diverse tasks (software, ML, cybersecurity) to establish a time horizon for AI capabilities.
- Time Horizon: A metric representing the time it takes for AI to achieve 50% success on tasks, showing exponential growth.
- Limitations of Benchmarks: Low context for human baselines, potential for saturation, and lack of messiness/real-world complexity.
- Field Experiment Findings: Experienced developers on large, mature open-source projects were slowed down by 19% when AI was allowed, despite having access to tools like Cursor and frontier models.
- Potential Explanations for Slowdown: Over-optimism about AI usefulness, high developer familiarity with tasks (reducing AI's benefit), low AI reliability, suboptimal capability elicitation, and interdependence across tasks.
- The Need for High Reliability: AI tools need to be highly reliable (95-99% success) to save developers time, otherwise, the verification and correction overhead negates the benefits.
- METER's Role: Researching AI capabilities and connecting them to potential catastrophic risks, emphasizing the need for more robust evidence beyond benchmarks.
- Methodology: Gathering human baseline data, measuring AI performance under identical conditions, converting time to AI time horizon, conducting field experiments (RCTs) with real developers and tasks.
- Speaker: Joel Becker (METR).
17. Poolside: Closing the Gap Between Models and Human Intelligence
- Main Argument: Poolside is building its own models from scratch, focusing on reinforcement learning paired with LLMs, to close the gap between AI capabilities and human intelligence in knowledge work, particularly in high-consequence environments.
- Key Points:
- Mission: To close the gap between models and human intelligence by building proprietary models from scratch.
- Reinforcement Learning + LLMs: A core belief that RL is essential for LLMs to make a significant leap in capability.
- Malibu Agent: Poolside's second-generation model, trained using proprietary techniques.
- High-Consequence Environments: Focus on government and defense sectors, requiring highly reliable and secure AI agents.
- ADA to Rust Conversion: Demo showcasing the agent's ability to translate codebases and handle testing and verification.
- Agent Control: Need for ratcheting down AI access and permissions in sensitive environments.
- Future Models: Plans for a third generation of models, leveraging massive compute (40,000+ GB200s) for further advancements.
- API Availability: Poolside models will be available via their own API and on Amazon Bedrock.
- Partnership Opportunities: Seeking collaboration with companies building AI assistants and other AI applications.
- Examples: ADA to Rust code conversion, adding interactive features (up arrow command history) to the converted code, poem generation (for personal use).
- Methodology: Building proprietary models, focusing on RL, targeting high-consequence environments, emphasizing reliability and security.
- Speakers: Jason Warner and ISO (Poolside).
18. Arise: Prompt Learning for Agent Optimization
- Main Argument: Prompt learning, using English feedback from evaluations to iterate on system prompts, offers a more efficient and accessible way to improve AI agent performance compared to traditional RL, especially for teams building coding agents.
- Key Points:
- System Prompt Importance: System prompts are critical for agent success and are continuously iterated upon.
- RL vs. Prompt Learning: RL is powerful but can be sample-inefficient, data-hungry, and time-intensive. Prompt learning offers a more direct path using English feedback.
- Prompt Learning Process: Agent attempts task -> runs unit tests -> LLM judge provides feedback and explanation -> feedback is used to update the system prompt.
- Eval Engineering: The quality of LLM-generated evaluations and explanations is crucial for effective prompt learning.
- Comparison to DSPI's GEA: Arise's approach is similar in using English feedback but claims to be more efficient, requiring fewer loops and rollouts due to better eval engineering.
- Results: Demonstrated improvements in SWEBench scores for Claude and Klein using prompt learning.
- Methodology: Prompt learning, LLM-as-a-judge for evaluations, iterating on system prompts based on feedback.
- Speaker: Aparna Dina Kuran (Arise).
19. Klein's Approach: Benchmarks as RL Environments and the Truth Nuke
- Main Argument: The real bottleneck for AI model improvement is not agent scaffolding but the quality and availability of benchmarks that drive RL training. Klein is releasing "ClientBench" to provide real-world coding data as standardized RL environments.
- Key Points:
- Capability Beats Scaffolding: Frontier models like Gemini 3.0 outperform complex agent harnesses, suggesting that model capability is paramount.
- Benchmarks Drive Model Improvement: AI models improve when trained on hard problems within well-designed RL environments, not through clever engineering tricks.
- RL Environments: Benchmarks are essentially environments for RL training, requiring a starting state, prompt, and a verifier.
- RL Environments Factory: Automating the process of converting real-world coding data into containerized RL environments.
- The "Truth Nuke" (ClientBench): An open-source benchmark initiative to provide real software development data for training and evaluating AI coding agents, moving beyond trivial code puzzles.
- Community Contribution: Encouraging developers to contribute their data by using the Klein provider and opting into the initiative.
- Open Source and Accessible: ClientBench is free, open-source, and inspectable, aiming to accelerate frontier research for the entire ecosystem.
- Methodology: Building RL environments from real-world data, automating environment creation, focusing on robust verification, and fostering an open-source community for benchmarks.
- Speaker: Nick Pash (Klein).
20. METR's Findings: The Reality of AI in Real-World Software Engineering
- Main Argument: While benchmarks show impressive AI capabilities, real-world field experiments with experienced developers suggest that AI tools can sometimes slow down productivity due to over-reliance, reliability issues, and the complexity of eliciting optimal behavior in messy, high-context environments.
- Key Points:
- Benchmark vs. Real-World Discrepancy: Benchmarks indicate rapid AI progress, but field studies show potential slowdowns for expert developers.
- Field Experiment Design: Studying experienced developers on large, mature open-source projects, comparing AI-allowed vs. AI-disallowed conditions.
- Key Finding: Developers were slowed down by 19% when AI was allowed, contrary to expectations.
- Potential Explanations: Over-optimism about AI, high developer familiarity reducing AI's benefit, low AI reliability, suboptimal capability elicitation, and task interdependence.
- The Need for High Reliability: AI tools must be highly reliable to be truly beneficial, otherwise, the overhead of verification and correction negates gains.
- Caveats: The study focused on expert developers and complex repositories; results may differ for other populations and tasks.
- Reconciling the Puzzle: Ongoing research to understand the discrepancy, including potential improvements in AI reliability, elicitation techniques, and the impact of task messiness.
- Methodology: Field experiments (RCTs), gathering human baseline data, analyzing developer behavior through screen recordings, and exploring potential explanations for observed productivity impacts.
- Speaker: Joel Becker (METR).
21. Google Anti-gravity: The First Agent-First IDE
- Main Argument: Google DeepMind's Anti-gravity IDE is a new agent-first development platform designed to leverage the latest AI model capabilities (Gemini 3, Nano Banana Pro) and introduce a new interaction paradigm with "artifacts" for managing agent workflows.
- Key Points:
- Agent-First Paradigm: Agents are central, operating across multiple surfaces (editor, browser, agent manager).
- Three Surfaces:
- Agent Manager: Central hub for managing multiple agents and artifacts.
- AI Editor: Familiar IDE experience with agent sidebar for focused tasks.
- Agent-Controlled Browser: Allows agents to interact with the web, retrieve context, and perform actions like testing apps.
- Artifacts: Dynamic representations of information generated by agents (plans, walkthroughs, images, screen recordings) used for organization, communication, and memory.
- Research-Product Flywheel: Anti-gravity is built by Google engineers for Google engineers, creating a feedback loop for model and product improvement.
- Leveraging Model Capabilities: Designed to take advantage of Gemini 3's intelligence, reasoning, tool use, longer task execution, and multimodal capabilities.
- Collaboration: Encouraging feedback and collaboration with the research teams to improve both the models and the product.
- Future Vision: Aiming to make software development more intuitive and accessible, allowing anyone to build software by abstracting away complexity.
- Technical Terms: Agent Manager, Artifacts, Agent-Controlled Browser, Computer Use, Image Generation, Multimodal, Research-Product Flywheel.
- Speaker: Kevin How (Google DeepMind).
22. AI Engineer Conference Announcements and Future Plans
- Main Argument: The AI Engineer conference series is growing, with new events planned in San Francisco (World's Fair 2026) and London (Baby World's Fair), and a new General Manager to oversee expansion.
- Key Points:
- Growth and Demand: Significant growth in applicant numbers and sold-out events indicate strong demand for the AI Engineer community.
- Summit vs. World's Fair: Summit events are smaller, single-track, and focused on specific themes, while World's Fair events are larger, multi-track, and aim to capture the breadth of AI.
- World's Fair San Francisco 2026: A four-day event at Moscone West, expanding capacity to accommodate growing demand.
- World's Fair Europe (London 2026): The first European iteration of the World's Fair, held at Queen Elizabeth 2 in Westminster.
- Partner Program: Launching a program to collaborate with local partners in different cities for future events (Paris, Miami, Melbourne).
- New General Manager: Leah McBride, with extensive experience in tech event marketing (Twitter, Google), joins to manage the company's growth and operations.
- Community Focus: Emphasis on building a strong community and providing valuable content and networking opportunities.
- Announcements: World's Fair San Francisco 2026, World's Fair Europe (London 2026), AI Engineer Miami, AI Engineer Melbourne, Leah McBride as General Manager, Partner Program.
- Speakers: Benjamin Dumpy and Swix (AI Engineer Co-founders), Leah McBride (General Manager).
Synthesis/Conclusion:
The AI Engineering Code Summit 2025 underscored a pivotal moment in software development, marked by the rapid advancement of AI models and the emergence of new paradigms for human-AI collaboration. The overarching theme was the transition from traditional coding to a more intuitive, agent-assisted workflow, encapsulated by the "New Code" and the imperative to "No More Slop." Key takeaways highlight the industry's move towards modular agent architectures ("skills"), the critical role of context engineering, and the development of specialized models like Cursor Composer.
The conference also emphasized the need for robust evaluation methodologies, moving beyond static benchmarks to dynamic, real-world assessments that account for complexity, reliability, and potential adversarial behaviors. Speakers from Poolside, Klein, and METR showcased different approaches to model development, evaluation, and the crucial role of RL and benchmarks in pushing AI capabilities.
Furthermore, the event underscored the transformative impact of AI on the developer experience, with platforms like AMP and Anti-gravity offering new interfaces and workflows. The discussions around "vibe coding" and "vibe engineering" reflected a shift towards more collaborative and intuitive ways of working with AI, while also cautioning against the pitfalls of prioritizing ease over simplicity and the potential for complexity to spiral out of control without proper planning and validation.
Finally, the conference concluded with significant announcements regarding the expansion of the AI Engineer event series, including World's Fair San Francisco 2026 and World's Fair Europe in London, signaling continued growth and commitment to fostering this rapidly evolving community. The overarching message is clear: AI is not just a tool but a fundamental shift in how software is conceived, built, and maintained, demanding a re-evaluation of our processes, skills, and the very definition of software engineering.
Chat with this Video
AI-PoweredHi! I can answer questions about this video "AI Engineer Code Summit 2025: AIE/CODE Track". What would you like to know?