Vision: Zero Bugs — Johann Schleier-Smith, Temporal

Key Concepts

Zero Bugs Vision: The aspiration for software with no defects.
Durable Execution: Software that reliably performs its intended function, especially in cloud deployments.
N-Version Programming: Using multiple independent implementations of the same software on different hardware/OS to ensure reliability.
Specification-Based Design: Designing software based on detailed, analyzable specifications to guarantee behavior.
Independent Verification Teams: Separating code writing from code checking to ensure objectivity.
Defensive Programming: Employing techniques to handle errors gracefully and prevent unexpected behavior.
Static Analysis and Verification: Automated techniques to detect bugs and prove code correctness without execution.
High-Level Languages: Programming languages that abstract away machine-specific details, improving productivity and readability.
Structured Programming: A programming paradigm that uses basic control structures (sequence, selection, iteration) to manage complexity.
Modularity: Designing software as independent, reusable components (modules) to manage complexity and facilitate verification.
Formal Methods: Mathematical techniques used to prove the correctness of software and hardware designs.
Theorem Proving: A formal method to prove mathematical statements about software properties.
Model Checking: A formal method to verify finite-state systems by exploring all possible states.
Agentic Coding: Software development using AI agents (LLMs) to generate code.
Software 3.0: The concept of programming through AI, where prompts function as programs.
High Assurance Software: Software designed and verified to meet extremely high standards of reliability and safety.

The Vision of Zero Bugs and Its Objections

The presentation begins by contrasting the everyday experience of most users, where popular applications generally work well, with the reality faced by software engineers: constant stress due to the possibility of errors in critical systems, on-call responses, and cloud outages. This disconnect highlights the pervasive nature of software bugs, even in seemingly minor incidents like a mini-golf reservation glitch.

The core argument is a push towards a vision of zero bugs, aiming for software with literally no defects. However, several objections are raised and addressed:

Incidents Happen: The world is imperfect, and software failures are inevitable due to cloud outages, order problems, or unforeseen real-world events. The argument is made that software is "good enough" in many cases.
Impossibility of Elimination:
- Complexity: Millions of lines of code, exacerbated by AI-generated code, make complete bug elimination seem impossible.
- Ambiguity: The definition of a bug itself is tied to user expectations, which can be ambiguous. Specifications may also have inherent ambiguities.
- Unforeseen Real-World Events: Control systems, like self-driving vehicles, struggle with modeling all possible real-world scenarios.
- Theoretical Limits: Even powerful verification techniques have computational limits and can be intractable.
Economics:
- Competitive Pressure: Competitors who prioritize speed over quality may win in the marketplace.
- ROI: The return on investment for fixing every single bug may not be justifiable, especially for minor issues with workarounds.
- Cynical View: Some companies might intentionally ship buggy software to sell support services.

Hope: Learning from High Assurance Industries

Despite the objections, the speaker contends that there is hope and presents practices that enable highly reliable software, drawing heavily from the aerospace industry.

Case Study: Airbus A320

The control software for the Airbus A320, developed in the 1980s, is highlighted as a showcase for reliability, with no serious incidents attributed to its software to date. Their approach involved:

N-Version Programming: Critical elements used different processors (e.g., x86, Motorola) and operating systems, with separate teams developing the software, providing significant redundancy.
Specification-Based Design: Extensive documentation allowed for analyzable and provable guarantees about system behavior under various scenarios.
Independent Verification Teams: Code writers and verifiers were distinct groups.
Defensive Programming Techniques:
- No memory allocation at runtime (all done statically).
- Simple, explicit error handling rather than sophisticated exception handling.
Static Analysis and Verification: Techniques to analyze code without execution.

The mindset was "zero defect tolerance," treating software as a certified component, akin to a mechanical part. A system-level approach to reliability was adopted, recognizing the interconnectedness of potential failures.

Quality Through Process in Aerospace

The aerospace industry emphasizes quality through process, a concept crucial for agentic coding as well. Key steps include:

Planning and Requirements
Certification by external agencies (regulators, government)
Integration testing with physical systems (especially relevant for future software interfacing with the physical world)
A feedback process for refining each step.

Other Aerospace Examples:

Space Shuttle: In its last three versions, with 420,000 lines of code each, there was only one error per version. Over 11 versions, a total of 17 errors were found, representing approximately 1,000 times fewer bugs per line of code than typical commercial software. No space shuttles were lost due to software problems.
Curiosity Rover: Developed in the 2000s, this mission required high reliability due to the cost and inability to intervene on Mars. It used redundant, identical systems and a commercial off-the-shelf real-time operating system, showing an evolution in reliable systems.

High assurance software is also critical in other industries like chemical, automotive, medical, nuclear power, and security systems.

Advances in Computer Science Foundations for Reliable Software

Several foundational advances in computer science have paved the way for building reliable software:

High-Level Languages (1950s-1980s)

Productivity Gain: Approximately 5-10x increase compared to assembly language.
Abstraction:
- Data Abstraction: Working with data structures relevant to the problem domain instead of raw memory locations.
- Structured Programming: Organizing code logically.
Preserving Essential Complexity: Focusing on problem-relevant aspects and removing implementation-specific details (registers, memory layout, machine performance).

Structured Programming (1960s-1970s)

Key Idea: Using basic control structures:
- Sequences: Statements executed in order.
- Selection: If-then-else logic.
- Iteration: Loops.
Benefits: Enabled compositional reasoning and eliminated "spaghetti code" (code with excessive GOTO statements and unstructured jumps).
Mitigates Complexity: Hierarchical decomposition allows programmers to focus on individual code pieces. This remains valuable for LLM-generated code.

Modularity (1970s onwards)

Concept: Designing software as independent, reusable modules.
Examples: Libraries, object-oriented programming.
Verification Boost: Modularity allows for local reasoning at each level, leading to sub-exponential or even linear scaling in verification complexity, rather than exponential.
Manageable Complexity: Regardless of system size, complexity remains manageable.

Why LLMs Don't Generate Machine Code Directly: The reasons for using high-level languages apply to LLMs too:

Limited Context: LLMs have a finite context window, similar to human working memory.
Library Trust: Using reliable, pre-tested libraries is more efficient than generating and verifying all code from scratch.

Formal Methods: Proving Correctness

Formal methods are mathematical techniques to prove software correctness.

Daphne Language and Demonstrations

The presentation features demos using the Daphne language, which allows embedding proofs directly within code and generating output for various languages (JavaScript, Python, C).

index_up function: A simple example demonstrating how to write assertions (e.g., array length > 0, returned index is valid or -1) and run the Daphne verifier.
Verification Success: The verifier confirms no bugs, and a Python program is generated that verifies before execution.
Bug Introduction: A small change to the algorithm introduces a bug, which the verifier immediately detects, preventing its execution.

Key Takeaway: Verification is only as good as the specification. Missing checks in the specification creates opportunities for bugs.

Commercial Relevance of Formal Methods

Formal methods have become commercially relevant:

seL4 Microkernel: A fully verified operating system for embedded and security-critical applications.
CompCert C Compiler: A verified compiler ensuring that generated code precisely matches the C program's intent, used in security-critical and aviation industries.
Project Everest: Focuses on verified libraries for cryptography, protecting internet traffic.
Microprocessors: Formal methods have been used for decades to ensure the correctness of microprocessor designs.

Progress in Verification: Over the last 20+ years, there has been significant progress in the size and speed of verification, driven by benchmarks. Success rates have risen from ~30% to nearly 100%, and runtime has decreased by over 50x.

Categories of Verification Tools:

Static Verification:
- Type Systems: Basic form of static verification.
- Daphne, Ron Spark, Ada: Tightly couple theorems with code.
- Lean, Coq: Provide theorem proving separate from code, requiring careful alignment.
Model Checking: Deals with finite-state machines and proving properties about them.
Theorem Proving: More powerful reasoning techniques, not limited to finite states.

Agentic Coding for High Assurance

Agentic coding, using LLMs for code generation, can be enhanced for reliability:

Detailed Specifications: Essential for guiding LLMs.
Type Languages: Leverage static typing for better code quality.
Modular Code: Break down problems into smaller, manageable modules.
Explicit Risk Analysis: Ask LLMs to perform risk analysis and write safety cases (statements about potential failures and their mitigations). This is qualitative reasoning LLMs can perform.
Separate Teams/Prompts: Use distinct prompts for code generation and testing, mimicking separate verification teams.
Multiple Model Providers: Employ different LLMs for code writing and test generation.
Formal Methods Integration: Apply formal methods to critical code sections.
Keep Code Small and Outsource: Utilize well-tested libraries for common functionalities.

Software 3.0 and New Assurance Techniques

Software 3.0, as promoted by Andrej Karpathy, views prompts as programs, with LLMs programming through AI. This paradigm presents new challenges:

Non-determinism and Huge State Space: Traditional verification techniques are largely ineffective.
New Failure Modes: LLMs have different failure modes.
Potential for New Resilience: LLMs can handle ambiguity and respond to unanticipated inputs. Architectures can be designed to invoke LLMs for error conditions, offering new forms of protection.

Cost of Agentic Coding

The cost of agentic coding is explored:

Example Game Generation: A simple game took 2 minutes of prompting, costing about $2.
Cost Breakdown: Output tokens are only ~15% of the cost; the rest is for repeated input tokens (tests) and reasoning tokens.
Human vs. AI Cost:
- High Assurance Code (Space Shuttle): ~$1,000/line (1990) to ~$2,500/line (2005), potentially up to $3,000/line for security-critical software.
- Typical Software Development: $10-$100/line.
- Low-Cost Contractors: $1-$10/line.
- Agentic Coding: A broad range, potentially much lower than typical software.

Key Insight: Agentic coding has the potential to produce high assurance software 100 times more cheaply than typical software is produced today, given the ~100x gap between high assurance and typical code costs.

Conclusion: The Path to Zero Bugs

The presentation concludes by reiterating that software reliability is a solved problem in industries like aerospace. With agents geared towards high assurance (using formal methods, extensive processes, adversarial testing), we can expect:

100x cheaper high assurance code.
A proliferation of bug-free experiences.

This push towards zero bugs also addresses limitations in current agentic coding, particularly the quality of generated software. When agentic coding consistently produces fewer defects than human-written code, its adoption will accelerate. The knowledge and techniques to achieve this have existed for decades.

The presentation ends with a lighthearted note about tardigrades (Temporal's mascot, Ziggy) not being bugs, emphasizing the company's focus on building durable execution for reliable modern software.