Claude, GPT-5, Gemini 3: Only 1 of 52 Jobs They Can Actually Do

By The AI Automators

Share:

Key Concepts

  • Delegate 52: A benchmark designed to evaluate LLM performance on long-horizon, multi-turn document editing tasks across 52 professional domains.
  • Document Corruption: The tendency of LLMs to silently degrade or alter document content during iterative editing.
  • Catastrophic Failure: Sudden, significant loss of data (10–30% in a single turn) that occurs even in high-performing models.
  • Agentic Harness: A system wrapper that provides LLMs with tools (read, write, code execution) to perform tasks.
  • Surgical Editing: A methodology involving targeted string replacement rather than full document regeneration.
  • Context Rot/Distraction: The degradation of model performance caused by bloated context windows or irrelevant information.
  • Human-in-the-Loop (HITL): The necessity of human oversight for high-stakes document delegation until reliability reaches near-perfect levels.

1. The Delegate 52 Benchmark

Microsoft researchers developed Delegate 52 to address the gap in AI research, which often focuses on coding rather than the "billions of dollars" of knowledge work performed in documents, spreadsheets, and slide decks.

  • Scope: 310 work environments across 52 domains (e.g., aviation, accountancy, robotics).
  • Methodology: Each environment uses real documents and 5–10 complex editing tasks. The test involves "round trips": a forward edit (e.g., splitting a ledger) followed by a backward edit (reverting to the original).
  • Goal: After a series of round trips, the final document should be identical to the source.

2. Research Findings and Data

The study evaluated 19 LLMs from six families (OpenAI, Anthropic, Google, Mistral, XAI, Moonshot).

  • Performance: Even frontier models (Gemini 3.1 Pro, Claude 4.6 Opus, GPT-5.4) corrupt an average of 25% of document content over long workflows.
  • Failure Patterns: 80% of total degradation stems from "critical failures." Stronger models do not avoid these; they simply delay them.
  • Domain Variance: Models perform better in programmatic domains (Python) and worse in natural language or niche domains (music notation, earnings statements).
  • Reliability Threshold: The researchers define "ready for delegation" as achieving a 98% accuracy score after 20 interactions. Currently, only the Python domain meets this criterion.

3. The "Tool Use" Paradox

The researchers tested an agentic harness (providing tools like read/write/delete/Python). Counterintuitively, models performed 6% worse with tools than without.

  • Reasons for Failure:
    • Overhead: Tool use increases input tokens by 2–5x, exacerbating context window issues.
    • Lack of Surgical Tools: The harness lacked a dedicated "edit" tool, forcing models to choose between full file rewrites (prone to error) or complex Python scripts.
    • Inconsistent Triggering: Models often favored file-writing over code execution, leading to higher corruption rates.

4. Actionable Frameworks for AI Architects

To build reliable document-editing systems, the following design patterns are recommended:

  • Separate Edit from Write: Implement two distinct tools. The edit tool should perform targeted string replacements (requiring exact character/whitespace matches), while write should be reserved for full rewrites and discouraged via system prompts.
  • Enforce "Read Before Edit": Prevent the model from editing files it has not explicitly read in the current session to avoid operating on stale memory.
  • Opinionated Harnesses: Do not rely on raw LLM capabilities. Enforce workflows (like those in Claude Code) that require checkpointing, exact string matching, and state rewinding.
  • Context Management:
    • Use reranking in RAG systems to minimize distractor context.
    • Implement context resets or sub-agents to handle large documents, as long-running context windows inevitably lead to "context rot."
  • Human-in-the-Loop (HITL): Until models reach 98%+ reliability, implement UI gates that display a "diff" between the original and the proposed changes for human verification.

5. Notable Quotes and Perspectives

  • "Current LLMs are unreliable delegates. They introduce sparse but severe errors that silently corrupt documents, compounding over long interactions."
  • "Stronger models don't avoid the failures, they just actually delay them."
  • "If you're delegating tasks, you need very high levels of accuracy and reliability... otherwise, you just end up babysitting the agent."

Synthesis

The Delegate 52 research highlights a critical vulnerability in current AI: the inability to maintain document integrity over long-horizon tasks. While frontier models are powerful, they are currently unsuitable for autonomous delegation in professional environments. The path forward for developers lies in harness engineering—moving away from raw, full-document regeneration toward surgical, tool-enforced editing workflows that prioritize precision, exact matching, and human oversight.

Chat with this Video

AI-Powered

Load the transcript when you're ready to chat so the initial page stays lighter.

Related Videos

Ready to summarize another video?

Summarize YouTube Video