Claude, GPT-5, Gemini 3: Only 1 of 52 Jobs They Can Actually Do
By The AI Automators
Key Concepts
- Delegate 52 Benchmark: A testing framework designed to evaluate LLM performance on long-horizon, multi-step document editing tasks.
- Long-Horizon Delegated Workflows: Complex, multi-turn tasks where an AI acts as an agent to modify professional documents over time.
- Catastrophic Single-Round Failure: A phenomenon where an LLM maintains high accuracy for several turns before suddenly causing significant document corruption in one step.
- Content Degradation: The loss or alteration of information within a document, which varies in detectability based on the model's sophistication.
The Delegate 52 Benchmark and Methodology
Microsoft researchers have introduced the Delegate 52 benchmark, a rigorous evaluation tool designed to test how Large Language Models (LLMs) handle professional document editing. Unlike single-shot tasks, this benchmark simulates real-world "long-horizon" workflows across diverse sectors, including robotics, accounting, aviation, translation, and presentation design.
- Scope: The benchmark consists of 310 distinct environments, each containing real-world documents and 5 to 10 complex editing instructions.
- Evaluation: Researchers tested 19 LLMs across six different model families to determine their reliability when acting as autonomous agents for knowledge workers.
Research Findings and Performance Metrics
The study reveals a significant gap between the commercial promise of AI-driven delegation (as seen in Microsoft’s Co-pilot suite) and the actual technical reliability of current LLMs.
- Degradation Rates: Even "frontier" models (e.g., Gemini 3.1 Pro, Claude 4.6 Opus, GPT 5.4) exhibit an average document corruption rate of 25% over the course of long workflows. Across all 19 models tested, the average degradation rate reached 50%.
- The "Catastrophic Failure" Pattern: The research highlights a deceptive pattern in model behavior. Models often maintain near-perfect reconstruction for several rounds, creating a false sense of security. This is followed by sudden, catastrophic failures where 10% to 30% of the document content is corrupted in a single turn.
- Domain Sensitivity: Performance is highly dependent on the domain:
- High Performance: Programmatic domains, such as Python code or database management.
- Low Performance: Natural language tasks and niche domains, such as financial earnings statements or music notation.
Qualitative Differences in Failure Modes
The researchers identified a critical distinction in how different tiers of models fail:
- Frontier Models: These models tend to corrupt content while preserving the document's structural integrity. This makes errors "harder to detect," as the document looks correct at a glance despite containing inaccurate or altered information.
- Weaker Models: These models are more prone to deleting content entirely. While this results in lower overall quality, the errors are "obvious to spot," making them easier for human users to identify and correct.
Implications for AI Delegation
The study presents a significant challenge to the current trajectory of AI integration in professional environments. While companies are actively selling "delegation" as a feature, the data suggests that LLMs are not yet reliable enough to handle multi-turn, complex document editing without human oversight. The researchers have made the Delegate 52 benchmark available via a GitHub repository, allowing for independent verification of these findings.
Conclusion
The core takeaway is that current LLMs suffer from a "reliability cliff" in long-horizon workflows. The tendency for frontier models to mask corruption within structurally sound documents poses a significant risk for professional applications where accuracy is paramount. Until these models can mitigate catastrophic single-round failures, delegating complex document editing to AI remains a high-risk activity that requires rigorous human verification.
Chat with this Video
AI-PoweredLoad the transcript when you're ready to chat so the initial page stays lighter.