Back to all videos

Fast Models Need Slow Developers — Sarah Chieng, Cerebras

By AI Engineer

AI Code Generation AI Inference Hardware LLM Developer Workflows

Share:

Key Concepts

Codex Spark: A state-of-the-art AI coding model capable of 1,200 tokens per second (20x faster than current industry standards like Sonnet or Opus).
Inference Stack: The hardware and software layers (Hardware, Model Architecture, Inference Optimization) that facilitate AI code generation.
Memory Wall: The bottleneck caused by moving model weights and KV cache values between memory and the processor, accounting for 50–80% of inference latency.
Disaggregated Inference: A strategy splitting "prefill" (compute-bound) and "decode" (memory-bound) tasks across different hardware optimized for each specific function.
KV Cache Reuse: Storing previously computed token representations to avoid redundant attention calculations.
Technical Debt: The accumulation of unverified, low-quality code generated by AI agents, which accelerates significantly with faster inference speeds.

1. The New Era of Fast Inference

Sarah Chang, Head of Developer Experience at Cerebras, highlights that while AI models have become smarter and have larger context windows, coding speeds have remained stagnant (50–150 tokens/sec) for two years. The introduction of Codex Spark (1,200 tokens/sec) marks a paradigm shift.

Why models are getting faster:

Hardware Innovation: Moving memory closer to the chip (e.g., Cerebras’s wafer-scale engine using on-chip SRAM) to bypass the memory wall.
Disaggregated Inference: Separating prefill and decode steps to use specialized hardware for each.
Model Architecture: Using Mixture of Experts (MoE) to activate only a subset of parameters per token, and Reap (Router Weighted Expert Activation Pruning) to remove inactive experts.
Inference Optimization: Advanced KV cache management to reduce redundant computation.

2. The "Bad Habits" of AI Development

The speaker warns that developers have developed poor workflows due to slow AI, which will become dangerous at 1,200 tokens/sec:

"One-shotting" massive prompts: Trying to do too much in a single request.
Agent Swarms: Running dozens of unmonitored agents simultaneously.
Lack of Verification: Generating massive amounts of code without human oversight, leading to unmanageable technical debt.

3. Practical Playbook for Fast Inference

To thrive in this new regime, developers must shift from "waiting for the model" to "collaborating with the model."

A. Orchestration and Model Selection

Tiered Intelligence: Use larger, "smarter" models (e.g., GPT-5.4) for high-level planning and long-horizon workflows. Use faster models (Codex Spark) as the "executor" for specific tasks.
Skill Capture: Use a smart model to solve a complex task once, capture the trajectory as a "skill," and then use a fast model to repeat that workflow reliably.

B. Real-Time Validation

At 1,200 tokens/sec, validation is effectively "free." Developers should integrate:

Automated Testing: Run linting, pre-commit hooks, and browser-based QA at every step.
Cherry-picking: Instead of generating one version of a feature, generate 15–75 variations across sub-agents and select the best one. This allows developers to "induce taste" into AI outputs without manual coding.

C. Active Steering

The "Pair Programmer" Mindset: Stop treating AI as a "set it and forget it" tool. Sit with the model, steer it, and maintain control.
Constraint Setting: Be specific. Ban the model from deleting files, set a max_diff_size, and provide iterative feedback (e.g., "Don't touch types yet," "Redo that implementation").

D. Context Management

Bounded Goals: Break large tasks into small, manageable goals to avoid hitting context limits and triggering "compaction" (which can lead to data loss).
The Four-File System: Maintain persistent external memory to keep agents aligned:
1. agents.md: Defines sub-agents.
2. plan.md: The master checklist.
3. progress.md: Tracks completed vs. pending tasks.
4. verify.md: Stores validation results for every step.

4. Synthesis and Conclusion

The transition to high-speed inference is not just about raw performance; it is a fundamental change in developer experience. By moving away from "sloppy" habits—such as running unverified agent swarms—and adopting a structured, collaborative, and verification-heavy workflow, developers can leverage Codex Spark to produce higher-quality code faster than ever before. The ultimate goal is to use AI as a tool for real-time collaboration, where the human remains the architect and the AI acts as a high-speed, verifiable executor.

Chat with this Video

AI-Powered

Load the transcript when you're ready to chat so the initial page stays lighter.

Related Videos

Ready to summarize another video?

Summarize YouTube Video