3 LLMs TESTED: Gemini 3 Pro V/S 4.5 Opus V/S GPT-5.1! Results are INSANE!

Key Concepts

AI Coding Models: GPT 5.1, Gemini 3.0, Claude Opus 4.5
Testing Methodology: Prompt adherence, code refactoring, system extension
Prompt Adherence: Literal interpretation vs. guideline interpretation of instructions.
Code Refactoring: Improving legacy code by addressing security vulnerabilities, naming conventions, validation, async patterns, transactions, and secrets management.
System Extension: Understanding existing architecture and adding new functionality while maintaining consistency.
Defensiveness: Adding safeguards and input checks beyond explicit requirements.
Completeness: Implementing all requirements and adding useful, often unrequested, features.
Precision: Adhering strictly to specified requirements with minimal deviation.
Verbosity: The amount of code generated, including comments, documentation, and explicit types.
Cost: The financial expense associated with using each model's API.
Auditability: The ease with which code can be reviewed for correctness and security.

Context and Release Timeline

In November, three major AI coding models were released:

November 12th: OpenAI released GPT 5.1 and GPT 5.1 Codeex Max.
November 18th: Google released Gemini 3.0, an upgrade from its 2.5 version.
November 24th: Anthropic released Claude Opus 4.5.

The video aims to provide a practical comparison of these models for real-world coding tasks, focusing on prompt adherence, refactoring a messy TypeScript API, and extending a notification system.

Testing Methodology

The comparison was conducted by Kilo Code using three distinct tests within the same environment, starting from an empty project.

Prompt Adherence: Tested using a Python rate limiter with 10 rigid rules. Code mode was used.
Code Refactoring: Involved restructuring a legacy TypeScript API handler with numerous security flaws. Code mode was used.
System Extension: Required understanding an existing notification system and adding an email handler. Ask mode was used for analysis, followed by code mode for implementation.

This methodology simulates how developers might use these models within IDEs like VS Code or JetBrains.

Test 1: Python Rate Limiter (Prompt Adherence)

Objective: To assess how literally each model follows a prompt with 10 strict rules.

Prompt Details:

Specific class name: TokenBucketLimiter.
Method signatures: try_consume returning a tuple.
Exact error messages.
Implementation details: time.monotonic and threading.Lock.

Results:

Gemini 3.0: Followed the specification precisely, generating simple, clean code without extra validation.
GPT 5.1: Adopted a defensive approach, adding input checks (e.g., positive tokens) and validating refill rate and initial tokens in the constructor, which were not explicitly requested.
Opus 4.5: Positioned between the two. Produced clean code with good docstrings but lost a point for naming an internal variable tokens instead of the specified current_tokens.

Takeaway: Gemini excels at literal interpretation for strict specifications. Opus offers a close second with better documentation. GPT adds safeguards that can be beneficial for production but may be inconvenient when exact minimal behavior is desired.

Test 2: TypeScript API Handler Refactor

Objective: To evaluate the models' ability to refactor a complex, insecure legacy TypeScript API handler.

Legacy Handler Issues:

365 lines of code.
Over 20 SQL injection vulnerabilities.
Inconsistent naming conventions (e.g., username vs. user_id).
Lack of validation.
Excessive use of any types.
Mixed asynchronous patterns.
Absence of database transactions.
Plain text secrets.

Tasks: Add Zod validation, fix security issues, and clean up the structure.

Results:

Opus 4.5: Achieved a perfect score (100/100) by implementing rate limiting, which was an explicit requirement.
GPT 5.1: Covered 9 out of 10 requirements and identified and fixed critical security leaks, such as ensuring user ownership of tasks before returning them. It also implemented database transactions for multi-step operations and validated both old and new field names for backward compatibility. It used environment variables for secrets.
Gemini 3.0: Completed 8 out of 10 requirements. It generated cleaner, faster code but missed architectural-level improvements like database transactions (noting them with a comment instead of implementing) and backward compatibility for field names. It hard-coded secrets.

Key Differences:

Authorization Checks: GPT 5.1 was defensive, fixing a potential leak. Gemini missed this.
Database Transactions: GPT 5.1 implemented them; Gemini commented on them.
Backward Compatibility: GPT 5.1 supported old and new field names; Gemini only supported new ones.
Rate Limiting: Opus 4.5 implemented it; GPT and Gemini ignored it.
Environment Variables: Opus 4.5 used them for secrets; GPT and Gemini hard-coded them.

Takeaway: Opus delivered complete end-to-end requirements, which is highly valuable for teams under pressure. GPT provided thorough, defensive code with excellent security awareness and architectural understanding. Gemini was faster but missed crucial architectural and security aspects.

Test 3: Notification System Understanding and Extension

Objective: To assess the models' ability to understand an existing system and extend it with new functionality.

Existing System: A 400-line system supporting webhook and SMS notifications.

Tasks:

Explain the architecture (Ask mode).
Add an email handler mirroring the existing pattern (Code mode).

Results:

Opus 4.5: Was the fastest (approx. 1 minute) and produced the most complete implementation (936 lines), including templates for all seven notification events and runtime template management. It balanced diagrammatic explanation with concrete code suggestions.
GPT 5.1: Produced a detailed 306-line audit with a Mermaid sequence diagram, specific line references, and identified hidden bugs. Its implementation was full-featured, mirroring the architecture well with support for multiple recipients and attachments. It generated 1.5 to 1.8 times more lines than Gemini due to JS Doc, error handling, and explicit types.
Gemini 3.0: Provided a concise 51-line summary, identifying high-level patterns and missing components but not digging deep into bugs. Its implementation was basic, focusing on sending emails and omitting attachments and recipient arrays. It reasoned longer before emitting code, making it more expensive than GPT 5.1 for this test despite shorter output.

Understanding Phase:

GPT 5.1: Detailed audit, Mermaid diagram, specific bug calls.
Gemini 3.0: Concise summary, identified patterns, but lacked depth on bugs.
Opus 4.5: Balanced diagrams with code suggestions, e.g., adding an abstract channel getter.

Implementation Phase:

Opus 4.5: Most thorough, including templates for all events, runtime management, and display name support.
GPT 5.1: Full-featured, mirroring architecture, handling attachments and multiple recipients.
Gemini 3.0: Basic email sending, skipped advanced features, assumed recipient email presence.

Takeaway: Opus provided the most comprehensive output. GPT demonstrated strong architectural understanding and a feature-rich implementation. Gemini was efficient but less detailed. All models noted a design flaw but maintained existing patterns for consistency.

Performance Summary

Speed: Opus 4.5 was the fastest overall (approx. 7 minutes total).
Output Volume: GPT 5.1 generated significantly more lines of code than Gemini due to its inclusion of JS Doc, error handling, and explicit types.
Cost: Gemini 3.0 was the cheapest overall. Opus was the most expensive but scored highest. The cost difference between Gemini ($1.68) and Opus ($110) was significant, potentially justifying Opus's cost for complete first-try implementations.
Score: Opus 4.5 achieved the highest average score.

Code Style Comparison

GPT 5.1: Verbose, with extensive JS Doc, error handling, and explicit types.
Gemini 3.0: Minimalist, shortest working implementation, fewer comments, looser types (e.g., any).
Opus 4.5: Organized, strict types, clear section headers, custom error classes (e.g., DatabaseError), and generic type parameters. It falls between GPT and Gemini in verbosity, prioritizing organization and completeness.

Prompt Adherence for Helpfulness

Test 1 (Rate Limiter):
- Gemini: Highest score for literal adherence.
- Opus: Second, with clean code and good docs.
- GPT: Lower score for adding unrequested features.
Tests 2 & 3 (Refactoring & Extension):
- Opus: Highest score for implementing everything and adding useful extras.
- GPT: Second, with defensive, well-documented code.
- Gemini: Third, for minimal interpretation that worked but missed deeper issues.

Conclusion: The choice depends on desired output: precision (Gemini), defensiveness (GPT), or completeness (Opus). Opus and GPT provided confidence in correctness and auditability.

Practical Tips

Opus 4.5: Be aware of extra features (runtime template management) and organizational overhead. Great for large projects, potentially overkill for small scripts. Requires configuration for environment variables.
GPT 5.1: Watch for overengineering, contract changes, and unrequested features that might impact flexible inputs.
Gemini 3.0: Look for missing safeguards, edge case handling, and documentation. Verify all requirements were met.

Prompting Strategy

Opus 4.5: To get minimal code, explicitly state it. Otherwise, expect full implementations with error handling, environment variables, and organized sections.
GPT 5.1: To get minimal code, explicitly state "don't add extra validation" and "keep it minimal."
Gemini 3.0: To get production-ready code, ask for extras like JSDoc, edge case handling, validation, and explicit implementation of every requirement.

This allows control over output verbosity and completeness.

Verdict

All three models can handle complex coding tasks.

Claude Opus 4.5: Comprehensive, organized, production-ready, fastest, highest average score, implements all requirements, and adds smart features automatically.
GPT 5.1: Thorough, defensive, well-documented, strong architectural understanding with diagrams, built-in safeguards, and backward compatibility.
Gemini 3.0: Exact, efficient, minimal, cheapest, follows specs literally, and avoids unrequested features.

The choice should be based on specific needs: completeness (Opus), defensiveness (GPT), or precision (Gemini). The clear trade-offs presented are invaluable for selecting the right model for a given workload.

3 LLMs TESTED: Gemini 3 Pro V/S 4.5 Opus V/S GPT-5.1! Results are INSANE!

Key Concepts

Context and Release Timeline

Testing Methodology

Test 1: Python Rate Limiter (Prompt Adherence)

Test 2: TypeScript API Handler Refactor

Test 3: Notification System Understanding and Extension

Performance Summary

Code Style Comparison

Prompt Adherence for Helpfulness

Practical Tips

Prompting Strategy

Verdict

Chat with this Video

Related Videos

Ready to summarize another video?