6 LLMs TESTED: GPT-5 v/s Sonnet 4.5 v/s Grok 4 & MORE!

Key Concepts

AI Model Comparison: Evaluating the performance of different AI models (GPT-5, OpenAI O3, Claude Opus 4.1, Claude Sonnet 4.5, Grok 4, Gemini 2.5 Pro) on real-world coding problems.
Real-World Coding Problems: Three specific scenarios tested: Node.js config merge vulnerability, modern agent workflow security, and ImageMagick command injection.
Test Harness: A consistent and repeatable setup used to evaluate the AI models.
Two-Phase Scoring: A methodology involving an AI judge with a rubric and human validation by engineers.
Rubric Criteria: Correctness, code quality, completeness, safety-minded practices, and performance.
Vulnerability Types: Prototype pollution, LLM prompt injection, and command injection.
Security Best Practices: Null prototypes, hasOwnProperty, allow lists, input sanitization, least privilege, trust boundaries, argument vectors, and rate limiting.
Cost-Benefit Analysis: Comparing the cost of using different AI models against their performance and the complexity of their solutions.
Pragmatic Recommendations: Matching AI model capabilities to specific use cases (mission-critical reviews vs. bulk scanning).

Walkthrough of AI Model Evaluation

Kilo Code blog conducted a comparative analysis of six AI models across three distinct real-world coding problems. The primary objective was to assess not only the models' ability to identify issues but also the quality, cost, and maintainability of their proposed fixes.

1. Node.js Config Merge Problem

Problem Description: This scenario involved a deep merge function in Node.js that directly incorporated user input into a settings object. Downstream security checks relied on an admin flag. A crafted payload could exploit prototype pollution, causing the admin flag to propagate, thereby elevating privileges unexpectedly. This is analogous to classic OWASP (Open Web Application Security Project) patterns seen in older CVEs (Common Vulnerabilities and Exposures).

Model Performance: All six models successfully identified the vulnerability. However, the quality and deployability of their fixes varied significantly:

GPT-5: Implemented a multi-layered approach including safe base objects with null prototypes, explicit blocking of risky keys, hasOwnProperty guards during merges, and freezing sensitive logic to prevent side effects.
OpenAI O3: Provided clean helper functions, a clear list of problematic keys, hasOwnProperty checks, and well-commented code for easier review.
Claude Sonnet 4.5: Employed multi-layer tightening with Object.create(null) and key blocking.
Gemini 2.5 Pro: Focused on key filtering and null prototypes but missed some recursive edge cases.
Claude Opus 4.1: Leveraged schemas and type checks, which were robust but potentially heavier to maintain.
Grok 4: Primarily focused on filtering and omitted hasOwnProperty validation on the administrative path, which was noted as a drawback.

Key Takeaway: The difference lay between a "good catch" and a "production-ready fix," highlighting the need for comprehensive and robust solutions.

2. Modern Agent Workflow (2025 Style)

Problem Description: This test simulated a modern AI agent workflow involving an agent fetching web page content, interpreting it, proposing tool calls to a cloud management API, and interacting with a WebAssembly (WASM) module with file system access. The vulnerability arose when a web page contained hidden instructions that the model interpreted as guidance, leading to the proposal of cloud calls with malicious parameters. This could result in cross-tenant changes and token exposure through the runtime, aligning with OWASP LLM01, LLM06, and LLM08 categories.

Model Performance:

GPT-5: Delivered an exceptionally strong solution featuring narrow tool scopes, output gating with a two-person confirmation rule, strict trust boundaries (preventing credentials from entering model text), sanitization and provenance checks on fetched HTML, and least privilege tokens (role-based, resource-scoped, and short-lived).
OpenAI O3: Was nearly as effective, providing a detailed analysis and even identifying "shadow tenant" style RBAC (Role-Based Access Control) scenarios. It included response schema validation and secure WASM configurations that disabled file system access.
Claude Sonnet 4.5: Proposed the correct theoretical approach (trust boundaries, provenance tracking, gating) but lacked depth in implementation.
Gemini 2.5 Pro: Scoped tools and used schema checks, but its gating mechanisms were considered less robust.
Claude Opus 4.1: Utilized Zod, DOM purify, and even diagrammed the flow, which was excellent for understanding but lighter on layered security.
Grok 4: Referenced OWASP AI Top 10 and NIST, employing allow lists, but its gating logic was simpler.

Key Takeaway: For newer and more complex patterns, reasoning depth became more critical than simple pattern matching, with GPT-5 and OpenAI O3 demonstrating superior performance.

3. ImageMagick Command Injection

Problem Description: This scenario involved an Express API that passed font size and text directly into an ImageMagick command string. If a malicious font name containing commands like rm-rf/ was provided, it could be executed, leading to unintended system modifications. This mirrors the historical "ImageTragick" vulnerabilities.

Model Performance: All models identified the command injection vulnerability. The most effective solutions involved layered security measures:

GPT-5: Offered a comprehensive fix including strict allow lists, absolute font paths to prevent special characters (e.g., MVG:, http:), bans on prefixes like inline: and caption@, switching to spawn or execFile with argument vectors (avoiding shell interpretation), piping text via standard input, and implementing size and rate caps, along with temporary file cleanup.
Claude Opus 4.1: Provided a thorough solution with spawn, allow lists, size range validation, control character filtering, rate limiting, explicit ImageMagick paths, and helpful demos for reviewers.
Claude Sonnet 4.5: Used execFile, strong allow lists, and rate limits.
OpenAI O3: Concisely switched to execFile with font validation and text sanitization.
Gemini 2.5 Pro: Employed spawn with allow lists and clean validation.
Grok 4: Explained shell parsing separators (;, |, &, `, $()), moved to spawn, and validated ranges.

Key Takeaway: The best fixes layered pure argument execution with strict allow lists and bans on problematic routes.

Cost Analysis and Recommendations

The total cost to run all three evaluations across the six models was approximately $181. The ImageMagick case was the most expensive due to the length and thoroughness of the optimal solutions. The Node.js merge case was the cheapest, averaging around $0.60 per evaluation, with individual model executions costing about $0.10.

Pragmatic Recommendations:

Budget-Conscious Bulk Scans: For cost-sensitive scenarios and bulk scanning, Gemini 2.5 Pro or OpenAI O3 offer 90-95% of GPT-5's quality at a 72% lower cost.
Mission-Critical Applications: For sensitive data (financial, health, administrative paths), investing in GPT-5 is recommended due to its superior layered guardrails.
General OWASP Reviews: Claude Sonnet 4.5 provides a good balance of effectiveness on familiar patterns and cost for broader OWASP-style reviews.

Human vs. AI Judgment

AI Judge: Selected GPT-5 as the best overall model based on the rubric.
Human Validation: Engineers preferred OpenAI O3 for deployment. The reasoning was that O3's fixes were simpler, more readable (reviewable in ~15 minutes), and still effectively addressed complex modern issues. GPT-5's solutions were considered "maximalist" and potentially more challenging to maintain long-term, mirroring real-world engineering trade-offs where the "most perfect" solution isn't always the most practical.

Conclusion and Key Takeaways

The evaluation demonstrated significant progress, with all models successfully identifying the tested vulnerabilities. The key differentiators were the completeness of the fixes, the implementation of layered guardrails, and the practical maintainability of the proposed solutions.

OpenAI O3: Emerged as a pragmatic choice for human engineers due to its balance of strength, readability, and cost-effectiveness at scale.
GPT-5: Is the recommended choice for critical systems where maximum security and layered defenses are paramount.
Gemini 2.5 Pro and Claude Sonnet 4.5: Serve as effective "workhorses" for routine code hygiene and day-to-day security checks.

The overarching advice is to match the AI model to the specific mission or task rather than seeking a single "best" model for all situations.