Back to all videos

Claude, GPT-5, Gemini 3: Only 1 of 52 Jobs They Can Actually Do

By The AI Automators

Constraint: No broad terms (e.g AI Technology"). Use precise terms Output Format: Comma-separated list only

Share:

Key Concepts

Delegate 52 Benchmark: A testing framework designed to evaluate LLM performance on long-horizon, multi-step document editing tasks.
Long-Horizon Delegated Workflows: Complex, multi-turn tasks where an AI acts as an agent to modify professional documents over time.
Catastrophic Single-Round Failure: A phenomenon where an LLM maintains high accuracy for several turns before suddenly causing significant document corruption in one step.
Content Degradation: The loss or alteration of information within a document, which varies in detectability based on the model's sophistication.

The Delegate 52 Benchmark and Methodology

Microsoft researchers have introduced the Delegate 52 benchmark, a rigorous evaluation tool designed to test how Large Language Models (LLMs) handle professional document editing. Unlike single-shot tasks, this benchmark simulates real-world "long-horizon" workflows across diverse sectors, including robotics, accounting, aviation, translation, and presentation design.

Scope: The benchmark consists of 310 distinct environments, each containing real-world documents and 5 to 10 complex editing instructions.
Evaluation: Researchers tested 19 LLMs across six different model families to determine their reliability when acting as autonomous agents for knowledge workers.

Research Findings and Performance Metrics

The study reveals a significant gap between the commercial promise of AI-driven delegation (as seen in Microsoft’s Co-pilot suite) and the actual technical reliability of current LLMs.

Degradation Rates: Even "frontier" models (e.g., Gemini 3.1 Pro, Claude 4.6 Opus, GPT 5.4) exhibit an average document corruption rate of 25% over the course of long workflows. Across all 19 models tested, the average degradation rate reached 50%.
The "Catastrophic Failure" Pattern: The research highlights a deceptive pattern in model behavior. Models often maintain near-perfect reconstruction for several rounds, creating a false sense of security. This is followed by sudden, catastrophic failures where 10% to 30% of the document content is corrupted in a single turn.
Domain Sensitivity: Performance is highly dependent on the domain:
- High Performance: Programmatic domains, such as Python code or database management.
- Low Performance: Natural language tasks and niche domains, such as financial earnings statements or music notation.

Qualitative Differences in Failure Modes

The researchers identified a critical distinction in how different tiers of models fail:

Frontier Models: These models tend to corrupt content while preserving the document's structural integrity. This makes errors "harder to detect," as the document looks correct at a glance despite containing inaccurate or altered information.
Weaker Models: These models are more prone to deleting content entirely. While this results in lower overall quality, the errors are "obvious to spot," making them easier for human users to identify and correct.

Implications for AI Delegation

The study presents a significant challenge to the current trajectory of AI integration in professional environments. While companies are actively selling "delegation" as a feature, the data suggests that LLMs are not yet reliable enough to handle multi-turn, complex document editing without human oversight. The researchers have made the Delegate 52 benchmark available via a GitHub repository, allowing for independent verification of these findings.

Conclusion

The core takeaway is that current LLMs suffer from a "reliability cliff" in long-horizon workflows. The tendency for frontier models to mask corruption within structurally sound documents poses a significant risk for professional applications where accuracy is paramount. Until these models can mitigate catastrophic single-round failures, delegating complex document editing to AI remains a high-risk activity that requires rigorous human verification.

Chat with this Video

AI-Powered

Load the transcript when you're ready to chat so the initial page stays lighter.

Related Videos

Ready to summarize another video?

Summarize YouTube Video