GPT-5.2 (V/S Gemini 3 & Opus 4.5) - Fully Tested: Is it the OPENAI Comeback or A FLOP?

GPT-5.2: A Detailed Analysis of Benchmarks and Performance

Key Concepts:

GPT-5.2: OpenAI’s latest model, positioned as a successor to GPT-5.1, with increased pricing.
Reasoning Tokens: Tokens used by the model during complex problem-solving, impacting cost.
Agentic Benchmarks: Testing models within an agentic framework, simulating real-world application development.
Non-Agentic Benchmarks: Evaluating models on isolated tasks without agentic orchestration.
Verdant & Kilo Code: Development platforms used for agentic testing, offering different features for managing and running agents.
Ultraink: A variant enhancing model capabilities through extended runtime and reasoning.
Hallucination: The tendency of a model to generate factually incorrect or nonsensical information.
Opus, Gemini, Sonnet: Competing large language models used for comparison.

I. Model Overview & Pricing

GPT-5.2 is presented as an upgrade to GPT-5.1, but with a notable price increase to $14 per million output tokens, matching the cost of the Sonnet model. This pricing is higher than Gemini and raises concerns about value. OpenAI claims potential cost savings in everyday usage due to improved reasoning token efficiency, but an “extra high” variant adds further complexity. A separate GPT-5.2 Pro model is now available via the API. The speaker expresses skepticism about the contradicting pricing and variant structure.

II. OpenAI’s Internal Benchmarks

OpenAI conducted several internal benchmarks to assess GPT-5.2’s capabilities:

OpenAI PRS (Pull Request Simulation): GPT-5.2 outperformed GPT-5.1 Codeex Max in handling real internal pull requests, including code modification and unit test passing.
MLE Bench (Kaggle Simulation): GPT-5.2 achieved the highest performance on this leaderboard, simulating Kaggle data science competitions with a GPU and 24-hour time limit.
OpenAI Proof Q&A: Surprisingly, GPT-5.2 scored lower than the previous Codeex model on this benchmark, which tests the ability to diagnose complex engineering bottlenecks. This suggests a potential trade-off: improved code generation but potentially reduced debugging skills.

A crucial finding regarding instruction following is that GPT-5.2 prioritizes strict output constraints (e.g., outputting only an integer) over honesty, increasing the likelihood of hallucination when it lacks a correct answer.

III. Non-Agentic Benchmark Results

The speaker conducted independent non-agentic benchmarks, evaluating GPT-5.2’s performance on various tasks:

Floor Plan Generation: Initial attempts failed, producing aesthetically dark maps lacking doors. The reasoning variant improved, generating a map with doors, but with illogical structural design (e.g., a disproportionately small living room).
SVG Panda Eating a Burger: The initial generation was “wonky” and poorly proportioned.
Pokeball in 3JS: Generated a satisfactory result, adhering to the prompt without unnecessary additions.
Chessboard with Autoplay: Generally fine, but not exceptional.
Minecraft in 3JS: Considered “atrocious” and of very poor quality.
Majestic Butterfly Flying in a Garden: A strong result, with good physics and animation, though leaning towards a dark aesthetic.
CLI Tool in Rust & Blender Script for Pokeball: Both failed to produce working code.
Riddle Solving: Failed to solve a riddle that even smaller models could handle.

The speaker notes that the non-reasoning variant of GPT-5.2 falls behind GPT-5.1, and overall, the model doesn’t compare favorably to Gemini 3 or Opus.

IV. Agentic Benchmark Results & Comparison

Agentic benchmarks were conducted using Verdant (for Opus 4.5 and Gemini 3 Pro) and Kilo Code (for GPT-5.2), testing the models’ ability to build applications:

Testing Methodology: Seven questions were used, with Verdant leveraging its “workspaces” (similar to Git work trees) for parallel testing and Kilo Code providing good reasoning support.
Movie Tracker App: GPT-5.2 produced a functional app with a GPT-5-like design, but Opus 4.5 generated a significantly more polished and professional-looking application. Sonnet 4.5 also delivered a solid result.
UI Calculator App: Opus again excelled, while Sonnet was somewhat flawed. Gemini produced a unique aesthetic. GPT-5.2 overengineered the solution, creating five files and a test suite, even for a simple task, and initially contained an error.
Spelt App: Opus demonstrated exceptional performance, completing the task in 20 minutes with full functionality (authentication, database integration, board creation). Sonnet matched Opus’s performance. Gemini 3 produced a dated-looking application. GPT-5.2 delivered a functional but unremarkable result.
Tari App: Gemini 3 failed completely. Opus worked, while Sonnet and GPT-5.2 failed.
Open Code & Nux App: Opus passed all tests, while Sonnet, Gemini, and GPT-5.2 failed the Open Code test.
Godo Game: All models passed.

V. Leaderboard & Overall Assessment

GPT-5.2 achieved eighth place on the leaderboard. However, given its comparable price to Sonnet 4.5, the speaker suggests that Sonnet, especially when used with Verdant and Ultraink, offers a better price-to-performance ratio. Opus 4.5 remains the superior choice for those who can afford it.

The speaker summarizes GPT-5.2 as “Gemini 3 by OpenAI,” noting its strong performance in simple tasks but its rapid decline in complexity when used in agentic workflows. They state they will continue using Gemini 3 and Opus 4.5 for their front-end and back-end development needs, respectively.

Quote: “If I had to summarize this model in one line, then I’d just say that it’s Gemini 3 by OpenAI.” - The speaker.

Conclusion:

GPT-5.2 demonstrates improvements in certain areas, particularly code generation, but exhibits weaknesses in debugging, complex reasoning, and maintaining output integrity under strict constraints. Its higher price point, coupled with its performance relative to competitors like Gemini 3 and Opus, makes it a less compelling option for many developers. The model appears best suited for straightforward tasks where its generation capabilities can shine, but requires careful supervision in more demanding agentic applications. The speaker’s benchmarks highlight the importance of considering the entire workflow, not just isolated model performance, when selecting a large language model.