Was I Wrong? GPT 5.3 vs Opus 4.6 — Round 2

GPT-53 vs. Opus 46: A Detailed Comparison of GitHub Stargazer Scraping & Connection Finding - Round Two

Key Concepts:

GPT-53 Codex: A large language model (LLM) focused on code generation and understanding.
Opus 46: Another LLM, positioned as more capable and “agentic” in task completion.
Genetic Browser Use: Utilizing an LLM to control a web browser for automated tasks like scraping and data extraction.
Micro-management: The degree of human intervention required to guide an LLM through a task.
Agentic Behavior: An LLM’s ability to independently plan, execute, and adapt to achieve a goal without constant human direction.
Rate Limiting: Restrictions imposed by websites (like X/Twitter) to prevent abuse by automated scripts.
Nudge: A small intervention or prompt given to an LLM to steer it towards a desired outcome.
Tokens: Units of text used by LLMs for processing; cost is often calculated per token.

Initial Comparison & The Original Mistake

The video revisits a previous comparison between GPT-53 Codex and Opus 46, focusing on their ability to scrape GitHub stargazers, find their X (formerly Twitter) and LinkedIn profiles, and identify potential connections. The initial assessment favored GPT-53, despite it not fully completing the task, due to its significantly lower cost ($0.29 vs. $20 for Opus 46) and the assumption that with more time and “nudges,” it could reach Opus’s completion rate. This conclusion was challenged by commenters who pointed out GPT-53’s incomplete task execution.

The original task involved scraping approximately 5,500 GitHub stargazers. Opus 46 successfully identified 674 valid links to connect with, at a rate of 38 links per minute. GPT-53 scraped 1,000 stargazers but only found 48 links (4 links per minute), relying on a more “brute force” approach. GPT-53 also had zero errors in link validity, while Opus 46 had a 7% error rate.

Round Two: GPT-53’s Full Run & Results

To address the criticism, a second run was conducted, allowing GPT-53 to continue until task completion. This run required eight interactions (or “nudges”) from the user. The results were as follows:

Completion: GPT-53 ultimately found 645 links.
Time: The full run took 41 minutes, significantly less than the initially estimated 3 hours.
Links per Minute: The rate increased to 15 links per minute, still slower than Opus 46’s 38 links per minute.
Errors: GPT-53 maintained its zero-error rate.
Cost: The final cost was $1.58, over 10 times cheaper than Opus 46.

Key Differences in Approach & Performance

The second run highlighted crucial differences in how the two models approach problem-solving:

Micro-management: Opus 46 demonstrated stronger “agentic” behavior, requiring less human intervention. It was described as a “senior engineer” who could be given a goal and independently pursue it. GPT-53, likened to a “junior engineer,” needed more specific instructions and frequent guidance. As stated, “You need to be very specific in details that could be derived by a smart person but not by models… You need to be very specific with what you want to achieve and why with GPT.”
Efficiency vs. Brute Force: Opus 46 worked “smarter, not harder,” efficiently identifying connections. GPT-53 initially employed a more laborious “brute force” method.
Adaptability & Problem Solving: GPT-53 encountered issues with X/Twitter’s rate limiting, requiring a human intervention to adjust its strategy (“Strategy was too aggressive for X”). Opus 46 appeared better equipped to handle such challenges independently. The speaker noted, “Opposite think sign box any any more efficient ideas.”
Code Execution: GPT-53, during the validation phase, switched to executing code directly (rather than relying on the LLM itself) to check links, demonstrating a degree of resourcefulness.

Cost Analysis & Strategic Implications

The significant cost difference remained a central argument. The speaker questioned the value of paying 10x more for Opus 46, especially for a task of “medium complexity.” He proposed a hybrid approach:

Opus for Research & Planning: Utilize Opus 46 for initial research, specification creation, and high-level planning.
GPT-53 for Implementation: Leverage GPT-53 as a “working horse” to execute the plan, benefiting from its lower cost. “I would use OPUS to research and create specification and plans in especially important areas. But I still would make this a working horse GPT53 in a sense that I would take research and plans from OPUS and give it to a team of GPTs to implement.”

Technical Challenges & Bugs

The run revealed a few technical issues:

X/Twitter Rate Limiting: GPT-53 triggered rate limits on X/Twitter due to aggressive profile scanning.
Chrome Connector Bug: A bug was identified in the Chrome connector, causing it to continue running even after tasks were completed.
LinkedIn Data Discrepancies: There were discrepancies between the number of LinkedIn connections reported by GPT-53 and the actual number on the user’s profile.

Conclusion & Key Takeaways

The video concludes that while Opus 46 is more capable and requires less micro-management, GPT-53 offers a compelling value proposition due to its significantly lower cost. The ideal strategy involves leveraging the strengths of both models: using Opus for high-level tasks and GPT-53 for efficient implementation. The speaker acknowledges a near draw in overall performance, but emphasizes that for tasks of this complexity, GPT-53 provides a more economically sensible solution. The experience reinforces the need for careful prompt engineering and a degree of human oversight even with advanced LLMs.