GPT-5.2 vs 5.1 Agents: Real Work Test

GPT-5.2 vs. GPT-5.1: A Head-to-Head Agent Building Test

Key Concepts:

GPT-5.2: OpenAI’s latest language model, positioned as the most capable for professional knowledge work.
Agents: AI systems designed to perform specific tasks autonomously, often utilizing tools and interacting with environments.
GDP Value Benchmark: A benchmark measuring model performance on economically valuable tasks.
SWE Benchmark: A benchmark evaluating model performance in software engineering tasks.
Long Context Window: The ability of a model to process and understand large amounts of text (up to 256,000 tokens in GPT-5.2).
Tool Calling: The model’s ability to utilize external tools to enhance its capabilities.
GWorkTrees Integration (Cursor): A system allowing parallel execution of multiple models on the same codebase.
One-shotting: The ability of an agent to complete a task with a single prompt and code execution.

I. GPT-5.2 Announcement Highlights & Benchmarks

The video focuses on evaluating whether GPT-5.2 represents a substantial improvement over GPT-5.1, particularly for agent building and real-world applications, rather than just being an incremental update. The analysis centers on key benchmarks outlined in OpenAI’s announcement.

GDP Value Benchmark: GPT-5.2 demonstrates a nearly double performance increase compared to GPT-5, with thinking enabled, and surpasses human expert level performance. This benchmark focuses on economically valuable tasks, unlike benchmarks like ARC AGI which are considered less indicative of practical utility. The model also generates significantly improved spreadsheets and slides.
SWE Benchmark: GPT-5.2 shows a significant improvement in software engineering tasks, even outperforming the recently released GPT-5.1 Codeex Max model. It also generates better UIs.
Long Context Benchmark: GPT-5.2 achieves near 100% accuracy on the four-needle-in-a-haystack test with up to 256,000 tokens, demonstrating its ability to handle extremely large context windows for tasks requiring deep analysis.
Vision Data Sets: Improvements in vision data processing unlock new use cases for agent builders.
Tool Calling Benchmark (Tower Bench): GPT-5.2 shows a 3 percentage point improvement (95% to 98%) in tool usage, which is described as a greater than 50% improvement due to the difficulty of achieving gains at higher accuracy levels. The model demonstrates improved orchestration of multiple tools and agents, performing more tool calls and handoffs (e.g., three handoffs to a compensation agent versus GPT-5.1’s single handoff).

Cost: GPT-5.2 is 1.4x more expensive than GPT-5.1. GPT-5.2 Pro is significantly more expensive at 12x the price of standard GPT-5.2, costing $168 for 1 million output tokens.

II. Agent Building & Comparison: Methodology

The video tests GPT-5.1 and GPT-5.2 by building three agents: a deck (presentation) agent, a repo (coding) agent, and a spreadsheet agent. The testing utilizes the agency starter template and the GWorkTrees integration from Cursor to run models in parallel. The agents are initially prompted with a broad instruction to build agents for real-world tasks. Later, agents are refined with more specific prompts and code execution capabilities.

III. Initial Agent Creation Results

GPT-5.1: Demonstrates a slower initial build speed. The resulting code structure is flawed, with all tools placed in a single file (contrary to the framework’s instructions). Crucially, it lacks a tool to modify spreadsheets, only offering a tool to load them. The agent struggles to complete tasks without the necessary tools.
GPT-5.2: Builds agents faster and with a correct code structure, placing each tool in a separate file. It includes all necessary tools for the coding agent (read, replace, write files, run shell commands). It demonstrates a better understanding of the prompt and framework guidelines.

IV. Task Performance: Wave Simulator & Initial Presentations

Wave Simulator (Coding Agent): GPT-5.1 generates a simplified version of the wave simulator, while GPT-5.2 produces a more visually impressive 3D visualization, closely resembling the example shown in OpenAI’s blog post.
Initial Presentations (Deck Agent): Both models create presentations, but the limitations are attributed to the tool itself, restricting detailed formatting and graph creation.

V. Enhancing Agents with Code Execution

The key to significant improvement is enabling agents to run code. This allows them to leverage libraries and functionalities beyond the model’s inherent capabilities.

Presentation Agent (Both Models): After enabling code execution, GPT-5.2 generates a substantial amount of code using the IPython tool and successfully creates a presentation. GPT-5.1 also generates code and creates a spreadsheet, but the results are significantly less sophisticated.
Spreadsheet Agent (Both Models): GPT-5.2 creates a comprehensive dashboard with multiple sheets, well-formatted data, headings, and usage instructions. GPT-5.1’s output is comparatively basic.

VI. Side-by-Side Comparison with Enhanced Agents

Using the same prompts and tools, a direct comparison reveals substantial differences:

Spreadsheet Agent: GPT-5.2’s output is far superior, creating a dashboard with multiple sheets, proper formatting, headings, and instructions.
Presentation Agent: GPT-5.2 generates more usable presentations with charts and cards, while GPT-5.1 produces basic slides with limited layouts.

Notable Quote: “So, as you can see, again, the new paradigm is to just let your agents run the code. It's just so much easier to build agents when you just give them the autonomy that they need to complete their tasks accordingly.” – The video creator emphasizes the importance of granting agents code execution capabilities.

VII. Conclusion & Takeaways

The video concludes that GPT-5.2 represents a significant improvement over GPT-5.1, particularly for agent building and professional knowledge work. The key takeaways are:

GPT-5.2 is demonstrably more capable: It excels in agent creation, code generation, spreadsheet manipulation, and presentation design.
Code execution is crucial: Enabling agents to run code unlocks significantly greater potential and allows them to leverage external libraries and functionalities.
GPT-5.2 understands complex tasks: It demonstrates an ability to orchestrate multiple tools and agents effectively.
The cost is a factor: GPT-5.2 is more expensive than GPT-5.1, and GPT-5.2 Pro is significantly more costly.
Future Potential: The advancements in agent capabilities suggest a future where minimal human intervention may be required for many tasks.

The video strongly supports OpenAI’s claims about GPT-5.2 being the best model for knowledge work, highlighting its improvements in areas that matter most for real-world applications and service-based industries.