Back to all videos

AIE Miami Day 2 ft. Cerebras, OpenCode, Cursor, Arize AI, and more!

By AI Engineer

Latency Debt Context Engineering Knowledge Graphs Inference Optimization

Share:

Key Concepts

Agentic Coding: The shift from human-written code to AI-driven development where agents act as autonomous or semi-autonomous programmers.
Latency Debt: The accumulation of delays in AI inference due to larger models and increased token usage, which degrades developer experience.
Context Engineering: The systematic design of providing models with relevant information, tools, and instructions in the correct format at the right time.
Knowledge Graphs (KGs): A method of structuring data to capture relational connections (structural similarity) rather than just semantic meaning (text similarity).
Inference Optimization: Techniques like speculative decoding, disaggregated inference, and specialized sub-agents to reduce compute costs and latency.
Agentic Memory: The ability of agents to persist, curate, and update context across sessions, moving beyond stateless LLM interactions.

1. Transforming Programming Mindsets (David House, G2I)

The Adoption Model: Successful adoption of coding agents requires moving from "disempowerment" (reacting to AI) to "agency" (steering the AI).
The Framework: G2I utilizes a staged handoff framework:
1. Slash-brief: Agent acts as a Product Manager to create a brief.
2. Slash-spec: Agent acts as a Technical Architect to create a spec.
3. Slash-code/test: Agent performs TDD (Test Driven Development).
4. Slash-review: Agent verifies implementation against the spec.
Key Insight: For beginners, frameworks should constrain input to prevent errors; for experts, frameworks should amplify input.

2. Solving Latency Debt (Sarah Chiang, Cerebras)

The Problem: While models have grown in intelligence and context window size, inference speed has stagnated (50–150 tokens/sec), creating "latency debt."
The Solution: Cerebras and OpenAI’s Codex Spark achieves 1,200 tokens/sec by optimizing the inference stack:
- Hardware: Moving memory on-chip (distributed SRAM) to solve the "memory wall."
- Disaggregated Inference: Separating "prefill" (compute-bound) and "decode" (memory-bound) tasks onto specialized hardware.
- Model Architecture: Using Mixture of Experts (MoE) and expert activation pruning to maintain intelligence while reducing compute costs.

3. Ambient Generative AI (Le Kalinowski, Kalstack)

Methodology: Deploying latent diffusion models directly on mobile Neural Processing Units (NPUs).
Innovation: Bypassing the text-to-embedding pipeline by using raw ambient sensor data (e.g., light sensors, accelerometers) to drive latent updates.
Performance: Achieved ~600ms latency with zero cloud API calls, proving that complex generative tasks can run locally on mobile hardware.

4. Sub-Agents and Specialized Models (Tis, Morph)

Software 3.5: The paradigm of "agents prompting other agents."
Specialization: General frontier models are expensive. Specialized models (e.g., for code search or context compaction) can perform specific tasks faster and cheaper.
Inference Optimization: Uses speculative decoding (heuristic-based guesses verified by a larger model) and kernel optimization to maintain speed.

5. Coding Agents as Software Primitives (Rick Blelock, Agentuity)

The Shift: Coding agents are evolving from "orchestration theater" (brittle workflows) to universal software primitives.
Real-World Application: Non-technical users are successfully running businesses (e.g., window cleaning, manufacturing) by using coding agents to manage their own workflows, proving that agents are becoming the software itself.

6. Context Engineering & Knowledge Graphs (Nia Mlin, Neo4j)

The "Fractured Context" Problem: Vector search captures semantic meaning but misses structural relationships.
The Solution: Context Graphs combine vector search with knowledge graphs to provide a "decision trace."
Research Finding: A study (Jang et al., 2026) showed that combining RAG with Knowledge Graphs increased accuracy from 54% to 91% compared to RAG alone.

7. Engineering Zero-Shot Compliments (Lena Hall, Akami)

Robotics Framework: A 5-layer architecture (Physical, Media, Orchestration, Tool/Motion, Personality) for a robot.
Key Principle: Model picks intent, runtime picks action. The model should never have direct motor access; it must issue tool calls that the runtime validates.
Latency as Design: Latency is not just a metric; it is part of the interaction design. Hesitation in a robot is perceived as confusion, not just a slow system.

8. The Future of IDEs (David, Cursor)

The Death of Traditional IDEs: The era of configuring 500+ lines of settings (like in VS Code or Emacs) is ending.
The New IDE: Needs to be a "white canvas" that supports rich interfaces (video playback, browser dev tools, mermaid diagrams) rather than just text editing.
Cursor 3.0: A ground-up rewrite designed to break away from VS Code baggage, allowing for a more malleable, agent-first interface.

9. MCP vs. CLI (Lori Voss, Arise AI)

The Experiment: Compared GitHub MCP servers vs. CLI (GH) tools using 500+ evaluation runs.
Findings:
- Latency/Cost: MCP was significantly slower and more expensive on complex tasks due to verbose JSON output.
- The Reality: Agents often "cheat" by using bash/CLI even when instructed to use MCP.
- Conclusion: It is not "MCP vs. CLI," but "MCP + CLI." Use CLI for local developer workflows; use MCP for remote, proprietary, or consumer-facing applications requiring OAuth.

Synthesis/Conclusion

The conference highlighted a clear transition: we are moving past the "hype" phase of AI into a "production" phase. The main takeaways are:

Infrastructure matters: Latency debt and compute scarcity are being solved by specialized hardware and inference optimization.
Context is king: Better models cannot fix fractured context; structural data (Knowledge Graphs) and agentic memory are required for reliability.
Engineering is back: As AI agents become autonomous, the "behavior runtime"—the code that manages boundaries, safety, and coordination—is becoming the most critical part of the product.

Chat with this Video

AI-Powered

Load the transcript when you're ready to chat so the initial page stays lighter.

Related Videos

Ready to summarize another video?

Summarize YouTube Video