Gemini-3.0 Pro Agentic Tests (& New KingEval): I TESTED Gemini-3 on AGENTIC TESTS & NEW BENCHMARK!

By AICodeKing

Share:

Key Concepts

  • Gemini 3 Pro: A new AI model tested for coding capabilities.
  • Kingbench: A benchmark for evaluating AI coding performance, with a 2.0 version and a planned harder 13-question set.
  • GDScript Bench: A benchmark specifically for evaluating AI performance in generating GDScript code for the Godot game engine.
  • Spelt Bench: A benchmark designed to test AI capabilities in generating Spelt code.
  • Intelligence Index: A composite score derived from averaging benchmark results to represent overall coding intelligence.
  • Price to Performance Chart: A metric that compares the cost of running benchmarks against the performance achieved.
  • Agentic Benchmarks: Tests that evaluate AI's ability to perform multi-step tasks and interact with tools.
  • Kilo Code: An AI coding assistant used for agentic benchmark testing.
  • Augment Code: An enterprise-grade AI assistant for engineering teams, highlighted as a sponsor.

Gemini 3 Pro Performance and Benchmarking

The video details the performance of Gemini 3 Pro, particularly in coding-related tasks, comparing it against other models like Sonnet, Opus, and GPT 5.1 codecs.

Kingbench 2.0 and New Benchmarks

  • Kingbench 2.0: Gemini 3 Pro achieved a perfect 100% score on this benchmark, indicating exceptional performance. This is noted as being approximately 50% better than Sonnet and twice as good as GPT 5.1 codecs in planning tasks.
  • New Benchmarks: To further assess AI capabilities, two new benchmarks have been developed:
    • GDScript Bench: This benchmark focuses on GDScript, the language used by the Godot game engine. It comprises 60 questions and is evaluated using unit tests for error checking and an LLM judge for code quality assessment. Godot is described as an open-source game engine.
    • Spelt Bench: This benchmark evaluates AI's ability to generate Spelt code. Similar to the GDScript bench, it uses an LLM judge and unit tests for scoring.
  • Intelligence Index: The scores from these benchmarks are averaged to create an "intelligence score," primarily focused on coding.
  • Cost-Effectiveness: The cost of running these benchmarks is also considered, with Gemini 3 Pro being noted as cheaper than Sonnet.

Benchmark Scores and Comparisons

The video presents specific scores for the developed benchmarks:

  • Average Intelligence Index:
    • Gemini 3 Pro: 60.4
    • Sonnet: 37.5
    • Opus: 34.9
    • GPT 5.1 Codec High: 31.3
  • Godo Bench Scores:
    • Gemini 3 Pro: 20.88
    • Followed by Opus, Sonnet, and others.
  • Spelt Bench Scores:
    • Gemini 3 Pro: 83.3
    • Followed by GPT 5.1 codecs and GPT 5 mini.
  • Reasoning Effort: All benchmarks were conducted using high reasoning effort via the Gemini 3 Pro API.
  • Price to Performance: Gemini 3 Pro ran all benchmarks for $2.85, significantly cheaper than Sonnet.

Future Benchmark Development

The creator plans to develop a new Kingbench setup with harder questions that are more "video friendly" to further challenge models. Viewers are encouraged to submit questions that models struggle with.

Agentic Benchmark Testing with Kilo Code

The video then shifts to testing Gemini 3 Pro's agentic capabilities using Kilo Code, a preferred AI coding assistant. Gemini 3 is noted to be under a waitlist for the Gemini CLI but available via API and in the anti-gravity editor.

Specific Agentic Tasks and Results

  • Movie Tracker App: Gemini 3 Pro generated a functional movie tracker app with a good homepage and inner pages, considered one of the finest for one-shot generation.
  • Godo Game (FPS): The model successfully implemented a step counter and a health bar affected by jumping in a basic FPS game, with customizable settings.
  • Go Tui Calculator: A functional calculator was generated, with good calculation accuracy and navigation.
  • Spelt App: While fully functional, the UI was not as impressive as what Sonnet could produce.
  • Undefeated Open Code Question: Gemini 3 Pro now passes this challenging question, previously only mastered by CodeBuff (a combination of models at a high cost). The generated SVG command follows UI aesthetics and works as intended, even outperforming CodeBuff's output.
  • Nux App: This task failed, with the app not opening and numerous errors. Notably, Sonnet, GPT5, and CodeBuff also failed this task.
  • Tari App: A successful generation that allows opening folders, listing images, cropping, and annotating.

Agentic Leaderboard and Overall Assessment

  • Leaderboard Position: Gemini 3 Pro achieved the highest position on the agentic leaderboard with a score of 71.4, breaking the 70% threshold for the first time. This surpasses CodeBuff.
  • Hallucinations: While Gemini 3 Pro can sometimes hallucinate in longer agentic tasks, it recovers well.
  • Potential: The creator believes that with better support from agentic contraptions and tuned system prompts, Gemini 3 Pro could perform even better.
  • Daily Driving: The creator is currently using Gemini 3 Pro daily and finds it superior to Sonnet for various tasks.

Sponsor: Augment Code

The video includes a sponsorship segment for Augment Code, described as an enterprise-grade AI assistant for engineering teams.

  • Key Features:
    • Proprietary context engine for millisecond-relevant snippets across large codebases (100k file monorepos).
    • Feeds entire repos into the best available model in real-time.
    • Offers smart, in-context suggestions for production code.
    • Works with Claude Sonnet 4+ and delivers high quality at the same price.
    • Automatic model upgrades without manual selection.
    • Seamless integration with VS Code, Jet Brains, Vim, and Cursor.
    • Secure by default, never trains on user code, and supports customer-managed encryption keys.
    • Pay-per-request pricing, no seat licenses or complex token math.
    • New features like remote agents for launching, monitoring, and merging pull requests from cloud workers.
  • Call to Action: A free 14-day trial is available at augmentcode.com.

Conclusion and Future Content

The creator expresses gratitude for the audience's support of their benchmarks and tests. They announce a future video focusing on the anti-gravity editor and its testing. Viewers are encouraged to share their thoughts, subscribe, and consider supporting the channel through donations or memberships.

Chat with this Video

AI-Powered

Load the transcript when you're ready to chat so the initial page stays lighter.

Related Videos

Ready to summarize another video?

Summarize YouTube Video