$300 Billion Evaporated from SaaS because of AI. But Is It Justified? Lets test it!

Key Concepts

GDP-Eval: A benchmark evaluating AI models on real economic tasks across 44 occupations.
OS World Benchmark: A test measuring agentic use of computers, comparing AI performance to humans.
Desktop Commander: An application developed by the speaker’s team, providing AI control over desktop environments (Windows & Mac).
Agentic Use: The ability of an AI to autonomously perform tasks, particularly those requiring computer interaction.
Stargazer Outreach: The process of identifying and connecting with individuals who have starred a GitHub repository.
Skill (within Desktop Commander): Reusable modules for performing specific tasks, potentially created from successful AI agent workflows.

The SAS Apocalypse & The Rise of AI Agents

The video addresses the recent significant drop in stock market value of SAS companies (approximately $300 billion, termed the “SAS apocalypse”) and investigates whether this is justified or simply market hype. The core argument is that the release of advanced AI models – specifically OpenAI’s GPT-53 Codex and Anthropic’s Opus 46 – is driving this concern, as these models demonstrate increasingly sophisticated capabilities in areas traditionally performed by human knowledge workers.

AI Model Performance & Benchmarks

The speaker highlights key benchmark results demonstrating the rapid improvement of these AI models:

GDP-Eval: Opus 46 is 14% better than Opus 45 and 34% better than Gemini 3 Pro on this benchmark. This suggests an improvement rate of approximately 7% per month in economic activity performance – a rate unprecedented in human progress.
OS World Benchmark: Opus 46 slightly exceeds human performance (72.7% vs. 72% for humans) in agentic computer use. GPT-53 Codex achieves 64.7% on this benchmark, still below human performance but rapidly closing the gap.
OpenAI’s Claims: OpenAI boasts GPT-53 Codex’s performance on the OS World benchmark, emphasizing its near-human capabilities in agentic tasks.

Practical Testing: GPT-53 Codex vs. Opus 46

To move beyond benchmarks, the speaker conducts a practical test using their application, Desktop Commander, to compare GPT-53 Codex and Opus 46 in automating a real-world task: identifying and connecting with GitHub stargazers (individuals who have “starred” an open-source project).

Methodology:

Task Definition: Automate the process of finding GitHub stargazers, checking for their presence on X (Twitter) and LinkedIn, and adding those not already connected to a list for follow-up.
Tool: Desktop Commander, which allows AI models to control desktop environments and applications.
Prompting: A detailed prompt was provided to both models outlining the task, desired outputs (CSV file of actionable profiles), and performance metrics (total profiles gathered, speed, token usage, human intervention).
Parallel Execution: Both models were run simultaneously in isolated environments to avoid interference.
Reflection & Optimization: The prompt encouraged the models to reflect on their performance and suggest optimizations for subsequent cycles.
Metrics: Total relevant profiles gathered, execution speed, token usage, and the amount of human intervention required were tracked.

Results of the Practical Test

The test revealed significant differences in the approaches and performance of the two models:

GPT-53 Codex: Initially relied on a pre-existing “stargazer skill” (a GraphQL API for GitHub data). Demonstrated a more conservative approach, requiring less human intervention. Achieved 48 actionable links in 12 minutes (4 links/minute).
Opus 46: Initially attempted to brute-force the task through browser automation, but then switched to utilizing the stargazer skill. Demonstrated a more aggressive and ultimately more effective approach, extracting 674 links in 17 minutes (38 links/minute).
Cost: GPT-53 Codex cost $0.29 to run, while Opus 46 cost $20 – a substantial difference attributed to Opus’s more computationally intensive approach.
Errors: GPT-53 Codex had zero errors, while Opus 46 had a 7% error rate due to issues with blocked accounts and invalid links.

Conclusion of the Test: GPT-53 Codex was deemed the “winner” due to its lower cost, higher accuracy, and slightly lower need for human intervention, despite being less initially “smart” in its approach.

Desktop Commander & The Future of AI-Assisted Workflows

The speaker emphasizes the potential of tools like Desktop Commander to leverage AI agents for automating complex tasks. The vision is to use powerful models like Opus to create reusable “skills” that can then be executed by more cost-effective models, creating a tiered system of AI assistance.

Is the SAS Apocalypse Real?

The speaker concludes that the “SAS apocalypse” is likely overhyped at this stage. While the advancements in AI are significant, the models still require substantial human oversight and are not yet capable of fully automating complex workflows without intervention. However, the speaker acknowledges that AI will fundamentally change the landscape for SAS companies, forcing them to adapt or risk becoming obsolete. The relationship between AI and SAS is likened to that between newspapers and Google – a disruptive force that necessitates evolution.

Notable Quote:

“Some of it feels like hype. On the other hand, it does seem to me that relationship between AI agents and SAS companies is kind of similar to relationship of newspapers to Google in a sense that it will definitely change the playing field and companies will need to adapt or they will potentially die.” – Speaker.

Technical Terms

GraphQL API: A query language for APIs, used by the stargazer skill to efficiently retrieve GitHub data.
Token Usage: A measure of the computational resources consumed by an AI model during processing. Higher token usage generally translates to higher cost.
Agentic Workflow: A sequence of actions performed autonomously by an AI agent to achieve a specific goal.
Micromanagement: The practice of excessively controlling or directing the work of others, highlighting the current need for human oversight of AI agents.

Logical Connections

The video progresses logically from observing market trends (the SAS apocalypse) to investigating the underlying cause (advancements in AI) to conducting a practical test to validate the claims. The test results are then used to draw conclusions about the current state of AI capabilities and the potential impact on the SAS industry. The discussion of Desktop Commander provides a concrete example of how these technologies can be applied in practice.

Data & Statistics

$300 billion: Estimated value evaporated in the stock market of SAS companies.
7% per month: Estimated improvement rate of Opus 46 in economic activity performance (based on GDP-Eval).
64.7% (GPT-53 Codex) vs. 72% (Humans): Performance on the OS World Benchmark.
72.7% (Opus 46): Performance on the OS World Benchmark, exceeding human performance.
4 links/minute (GPT-53 Codex) vs. 38 links/minute (Opus 46): Link extraction rate during the GitHub stargazer outreach test.
$0.29 (GPT-53 Codex) vs. $20 (Opus 46): Cost of running the GitHub stargazer outreach test.
7% error rate (Opus 46) vs. 0% (GPT-53 Codex): Error rate during the GitHub stargazer outreach test.

Synthesis/Conclusion

The video presents a nuanced perspective on the “SAS apocalypse,” acknowledging the significant advancements in AI but cautioning against excessive hype. While AI models are rapidly improving and demonstrating impressive capabilities, they still require human oversight and are not yet capable of fully automating complex tasks. The development of tools like Desktop Commander, which facilitate AI-assisted workflows, represents a promising path forward, allowing for the creation of reusable skills and the efficient allocation of computational resources. The SAS industry will undoubtedly be disrupted by AI, but adaptation and innovation will be key to survival.