🚀 GPT-4.1 Is Great at Coding, But I Won’t Use It. Here’s Why!

Key Concepts

GPT-4.1, coding benchmarks (HumanEval, Polyglot), token cost, Gemini 2.5 Pro, DeepSeek 3, agentic capabilities, tool usage, code generation, code modification, single-shot code generation, Model Context Protocol (MCP), Agent to Agent Protocol, hallucination, P5.js, HTML, CSS, JavaScript, web search tool.

GPT-4.1 Coding Capabilities: An Evaluation

This video evaluates the coding capabilities of GPT-4.1, comparing it to other models like GPT-4.0, Gemini 2.5 Pro, and DeepSeek 3. The speaker uses the OpenAI playground to conduct coding tests, focusing on creative freedom, instruction following, and tool usage.

Preliminary Coding Tests

Landing Page Generation: GPT-4.1 was prompted to create a modern landing page using HTML, CSS, and JS in a single file. The model generated the code in 33 seconds, producing approximately 3,000 tokens. The resulting website was deemed "pretty decent looking" with a hero section and a contact form.
Interactive Website with Jokes: The model was tasked with creating a website featuring a button that, when clicked, displays a random joke, changes the background color, and adds an animation. The model successfully implemented these features.
Encyclopedia of Pokémon: The model was asked to create a simple encyclopedia of the first 25 legendary Pokémon, including their types, short descriptions, and images, all within a single file using HTML, CSS, and JS. While the model successfully generated a list of Pokémon with descriptions, the initial image URLs were non-functional. When prompted to use a web search tool to find working image links, the model generated updated code with functional links.

Tool Usage and Agentic Capabilities

Model Context Protocol (MCP) and Agent to Agent Protocol: The model was asked to define "Model Context Protocol" (MCP) and "Agent to Agent Protocol" and explain their differences. Initially, without being explicitly told to use the web search tool, the model hallucinated information, confidently providing incorrect definitions and even fabricating a table. When prompted to use the web search tool, the model provided a more accurate definition of MCP, attributing it to Anthropic, but still struggled to find information on the Agent to Agent Protocol. This highlighted a potential issue with the model's ability to effectively utilize tools, even when available.

Complex Coding Challenges

TV Channel Animation: The model was instructed to code a TV channel interface with number keys from 0 to 9, each representing a different channel inspired by classic TV genres. Each channel was to display interesting animations and a creative name within a square box using P5.js, all masked to a TV set area and contained in a single file. The resulting animations were considered high quality and creative, although the sketch was not perfectly square.
Falling Letters Animation: The model was tasked with creating a JavaScript animation of falling letters with realistic physics, including random appearance at the top of the screen with varying sizes, Earth's gravity, collision detection based on letter size and shape, and interaction between the letters, the ground, and screen boundaries. The model successfully generated code that produced the desired animation, with letters falling, interacting with the ground, and colliding with each other.
Bouncing Balls in a Heptagon: The model was challenged to create an HTML program displaying 20 numbered balls bouncing inside a spinning heptagon, subject to gravity, friction, and realistic collisions with each other and the heptagon's sides. Despite specific instructions and color profiles, the model failed to produce the desired behavior. The balls appeared to line up and move in unison, without proper collision detection.

Performance and Cost Analysis

The speaker argues that despite GPT-4.1's impressive coding capabilities, it may not be the best choice for daily use due to its cost and performance compared to other models.

Benchmarks: GPT-4.1 achieves a coding success rate of approximately 52% on the HumanEval/Polyglot coding benchmark.
Cost: The cost of running GPT-4.1 is approximately $9.86 (or $10) on the benchmark.
Comparison: Gemini 2.5 Pro, with a coding success rate of around 66% on the same benchmark, can be less expensive than GPT-4.1 if the output token count is below 200,000. DeepSeek 3 and Gemini Flash are also presented as more cost-effective options for smaller tasks.
Pier Bongard's Perspective: The video references a tweet from Pier Bongard, an AI scientist at Harvard, suggesting that Gemini 2.5 Pro or DeepSeek R1 might be better alternatives.

Conclusion

While GPT-4.1 represents an upgrade over GPT-4.0, its price point and performance, as measured by coding benchmarks, make it less compelling than alternatives like Gemini 2.5 Pro and DeepSeek 3. The speaker acknowledges that benchmarks may not capture the full picture and plans to continue testing GPT-4.1 for agentic tasks, seeking specific use cases where it might excel. The speaker encourages viewers to share their experiences and preferred models.