What Do Models Still Suck At? - Peter Gostev, Arena.ai, BullshitBench

Key Concepts

LLM Benchmarking: The practice of evaluating model performance using standardized tests versus real-world user interactions.
"Busher" Benchmark: A custom benchmark designed to test how models handle nonsense or logically flawed prompts.
LMSYS Chatbot Arena: A crowdsourced platform where users compare two anonymous models side-by-side to determine which provides a better response.
Dissatisfaction Rate: A metric used in the Arena to track how often users find the responses from both models to be poor or unhelpful.
Reasoning Models: Advanced models designed to "think" through problems before outputting a response.
Model Hallucination/Compliance: The tendency of models to attempt to answer nonsensical questions rather than identifying the logical fallacy.

1. The "Busher" Benchmark: Testing Logical Pushback

The speaker introduces a custom benchmark designed to challenge the "blind" helpfulness of LLMs.

Methodology: The benchmark consists of ~155 nonsensical questions (e.g., asking to correlate repository age and file size with indentation style).
Key Finding: Most models, including top-tier ones, struggle to identify nonsense. They often attempt to "solve" the prompt rather than pointing out that the premise is flawed.
Performance: While newer models (like Claude 3.5 Sonnet) show better "pushback" (refusing to answer nonsense), many industry-standard models (GPT-4, Gemini) still accept the premise roughly 50% of the time.
The "Reasoning" Paradox: Contrary to popular belief, increasing a model's "reasoning" or "thinking" time often makes it worse at handling nonsense. The speaker notes that models are trained to "solve at all costs," leading them to generate long, complex, but ultimately meaningless responses to flawed prompts.

2. Insights from the LMSYS Chatbot Arena

The Arena provides a longitudinal view of model performance based on over 5.5 million human votes.

Dissatisfaction Rate: This metric tracks instances where users reject both model outputs. While the overall dissatisfaction rate has dropped from ~17% (pre-reasoning models) to ~9% today, it remains stubbornly high.
Category-Specific Performance:
- Quantitative (Math/Physics): Shows significant improvement over time, with a sharp drop in dissatisfaction rates.
- Creative Writing: Shows only marginal improvement.
- Specialized Fields (Law/Finance/Gaming): These areas show very little improvement, suggesting that current model training is not prioritizing these domains.
The "Expectation Shift": The speaker argues that the "line goes up" in benchmarks is partially due to models getting better, but also because user expectations and the complexity of prompts have evolved simultaneously.

3. Key Arguments and Perspectives

The "AGI Psychosis": The speaker warns against the collective anxiety that we are on the verge of AGI. He suggests that while benchmarks show constant progress, they often measure narrow, well-defined tasks that do not reflect the messiness of real-world "white-collar" work.
The Failure of "Thinking": The speaker highlights that high-reasoning traces often reveal models questioning a premise in one sentence and then spending 20 paragraphs trying to solve it anyway. This indicates a lack of training in "knowing when to stop."
The "Bottom of the Distribution": The speaker advocates for focusing on improving the "bottom" of the model distribution—ensuring that models are reliable across all tasks, not just the frontier tasks that look good on marketing charts.

4. Notable Quotes

"It’s really, really surprised me how easy it was for the models to just go along with the complete nonsense questions."
"Reasoning often actually goes in reverse and doesn't help. It actually makes it worse."
"I don't think LLMs really get games... the mechanics are all over the place. They're not interesting. They're not challenging."
"There’s much more to what work is... that is not really captured by these benchmarks."

5. Synthesis and Conclusion

The presentation concludes that while LLMs are undeniably improving, there is a significant disconnect between "benchmark performance" and "real-world utility." The tendency of models to prioritize compliance over logical integrity—even when using advanced reasoning—remains a critical bottleneck. The speaker suggests that the industry should move beyond simple "line goes up" metrics and focus on improving model reliability in complex, nuanced, and non-standardized domains where current models still frequently fail to provide meaningful value.