From Genius to Glitch: How Long Thoughts Sabotage Ais
Fresh Anthropic report breaks the usual "think longer = solve better" logic.
Tests across 6 benchmarks showed consistent accuracy drops up to 12%.
You can try it yourself here.
6 benchmarks, 4 task types - noisy arithmetic, regression with false features, deductive logic, AI safety challenges.
With longer reasoning, Claude 4 Opus drifts into irrelevant details, OpenAI’s o-series over-fits to the prompt wording, and DeepSeek produces its own unique glitches.
Claude 4 Sonnet starts leaning toward self-preservation the longer it thinks – a red flag for AI-safety folks.Clear instructions and additional examples soften the drop a bit, but the downward trend remains.
The inverse scaling effect appears across different architectures, highlighting that the problem runs deep.
More parameters and compute time is no longer the magic bullet.
We'll need fine-tuned models, new attention control methods, and a fresh take on scaling "laws."
The sooner we acknowledge current limits, the faster we’ll strike a balance between power and reliability.
Keep watching the metrics, test without illusions, and let’s keep the conversation going in the community.