AIE Miami Day 2 ft. Cerebras, OpenCode, Cursor, Arize AI, and more!
By AI Engineer
Share:
Key Concepts
- Agentic Coding: The shift from human-written code to AI-driven development where agents act as autonomous or semi-autonomous programmers.
- Latency Debt: The accumulation of delays in AI inference due to larger models and increased token usage, which degrades developer experience.
- Context Engineering: The systematic design of providing models with relevant information, tools, and instructions in the correct format at the right time.
- Knowledge Graphs (KGs): A method of structuring data to capture relational connections (structural similarity) rather than just semantic meaning (text similarity).
- Inference Optimization: Techniques like speculative decoding, disaggregated inference, and specialized sub-agents to reduce compute costs and latency.
- Agentic Memory: The ability of agents to persist, curate, and update context across sessions, moving beyond stateless LLM interactions.
1. Transforming Programming Mindsets (David House, G2I)
- The Adoption Model: Successful adoption of coding agents requires moving from "disempowerment" (reacting to AI) to "agency" (steering the AI).
- The Framework: G2I utilizes a staged handoff framework:
- Slash-brief: Agent acts as a Product Manager to create a brief.
- Slash-spec: Agent acts as a Technical Architect to create a spec.
- Slash-code/test: Agent performs TDD (Test Driven Development).
- Slash-review: Agent verifies implementation against the spec.
- Key Insight: For beginners, frameworks should constrain input to prevent errors; for experts, frameworks should amplify input.
2. Solving Latency Debt (Sarah Chiang, Cerebras)
- The Problem: While models have grown in intelligence and context window size, inference speed has stagnated (50–150 tokens/sec), creating "latency debt."
- The Solution: Cerebras and OpenAI’s Codex Spark achieves 1,200 tokens/sec by optimizing the inference stack:
- Hardware: Moving memory on-chip (distributed SRAM) to solve the "memory wall."
- Disaggregated Inference: Separating "prefill" (compute-bound) and "decode" (memory-bound) tasks onto specialized hardware.
- Model Architecture: Using Mixture of Experts (MoE) and expert activation pruning to maintain intelligence while reducing compute costs.
3. Ambient Generative AI (Le Kalinowski, Kalstack)
- Methodology: Deploying latent diffusion models directly on mobile Neural Processing Units (NPUs).
- Innovation: Bypassing the text-to-embedding pipeline by using raw ambient sensor data (e.g., light sensors, accelerometers) to drive latent updates.
- Performance: Achieved ~600ms latency with zero cloud API calls, proving that complex generative tasks can run locally on mobile hardware.
4. Sub-Agents and Specialized Models (Tis, Morph)
- Software 3.5: The paradigm of "agents prompting other agents."
- Specialization: General frontier models are expensive. Specialized models (e.g., for code search or context compaction) can perform specific tasks faster and cheaper.
- Inference Optimization: Uses speculative decoding (heuristic-based guesses verified by a larger model) and kernel optimization to maintain speed.
5. Coding Agents as Software Primitives (Rick Blelock, Agentuity)
- The Shift: Coding agents are evolving from "orchestration theater" (brittle workflows) to universal software primitives.
- Real-World Application: Non-technical users are successfully running businesses (e.g., window cleaning, manufacturing) by using coding agents to manage their own workflows, proving that agents are becoming the software itself.
6. Context Engineering & Knowledge Graphs (Nia Mlin, Neo4j)
- The "Fractured Context" Problem: Vector search captures semantic meaning but misses structural relationships.
- The Solution: Context Graphs combine vector search with knowledge graphs to provide a "decision trace."
- Research Finding: A study (Jang et al., 2026) showed that combining RAG with Knowledge Graphs increased accuracy from 54% to 91% compared to RAG alone.
7. Engineering Zero-Shot Compliments (Lena Hall, Akami)
- Robotics Framework: A 5-layer architecture (Physical, Media, Orchestration, Tool/Motion, Personality) for a robot.
- Key Principle: Model picks intent, runtime picks action. The model should never have direct motor access; it must issue tool calls that the runtime validates.
- Latency as Design: Latency is not just a metric; it is part of the interaction design. Hesitation in a robot is perceived as confusion, not just a slow system.
8. The Future of IDEs (David, Cursor)
- The Death of Traditional IDEs: The era of configuring 500+ lines of settings (like in VS Code or Emacs) is ending.
- The New IDE: Needs to be a "white canvas" that supports rich interfaces (video playback, browser dev tools, mermaid diagrams) rather than just text editing.
- Cursor 3.0: A ground-up rewrite designed to break away from VS Code baggage, allowing for a more malleable, agent-first interface.
9. MCP vs. CLI (Lori Voss, Arise AI)
- The Experiment: Compared GitHub MCP servers vs. CLI (GH) tools using 500+ evaluation runs.
- Findings:
- Latency/Cost: MCP was significantly slower and more expensive on complex tasks due to verbose JSON output.
- The Reality: Agents often "cheat" by using bash/CLI even when instructed to use MCP.
- Conclusion: It is not "MCP vs. CLI," but "MCP + CLI." Use CLI for local developer workflows; use MCP for remote, proprietary, or consumer-facing applications requiring OAuth.
Synthesis/Conclusion
The conference highlighted a clear transition: we are moving past the "hype" phase of AI into a "production" phase. The main takeaways are:
- Infrastructure matters: Latency debt and compute scarcity are being solved by specialized hardware and inference optimization.
- Context is king: Better models cannot fix fractured context; structural data (Knowledge Graphs) and agentic memory are required for reliability.
- Engineering is back: As AI agents become autonomous, the "behavior runtime"—the code that manages boundaries, safety, and coordination—is becoming the most critical part of the product.
Chat with this Video
AI-PoweredLoad the transcript when you're ready to chat so the initial page stays lighter.