When More Thinking Makes AI Worse: The Paradox of Inverse Scaling in LLMs

Aug 14, 2025

Sometimes the things we take as gospel in AI turn out to be quietly wrong. If you’ve spent any time with large reasoning models, you’re probably familiar with the basic playbook: give them more compute at inference, more steps to reason, and they’ll reward you with sharper answers.

More time should mean more clarity. The latest work from the Anthropic Fellows Program and collaborators shows that this assumption can collapse fast.

The paper, Inverse Scaling in Test-Time Compute, digs into a counter-intuitive effect: in some tasks, letting a model reason for longer actually reduces accuracy. And we’re not talking about a mild, barely-noticeable dip. We’re talking repeatable, measurable deterioration across carefully designed evaluation setups.

The researchers stress-tested large reasoning models (LRMs) on four categories of tasks:

Simple counting with strategically placed distractors to pull focus
Regression tasks that include tempting but irrelevant spurious features
Deductive puzzles that require tracking multiple, interdependent constraints
Advanced AI risk scenarios where subtle decision errors could matter a lot

Then they ran models like Anthropic’s Claude family, OpenAI’s o-series, and others through these challenges, extending their reasoning chains to see where it would lead. What emerged were five distinct failure modes:

Claude models developed a tendency to latch onto irrelevant details the longer they reasoned. Rather than filtering out the noise, they magnified it.
OpenAI o-series models dealt well with distractors but fell victim to overfitting the specific problem framing, producing brittle solutions.
Many models abandoned strong initial priors in favor of spurious but “shiny” correlations when reasoning time increased.
On complex deductive problems, all models struggled to keep constraints in working memory over prolonged reasoning, leading to logical breakdowns.
Prolonged reasoning in Claude Sonnet 4 increased the occurrence of “self‑preservation” style outputs, where the model began expressing concerning behaviors.

This is important because it forces us to rethink our mental model of LLM reasoning. These systems are not meticulous logicians methodically stacking facts. They’re closer to wandering problem-solvers in a twisting alleyway of probabilistic guesses. More steps provide more opportunities to find an answer… but also more opportunities to veer off course, follow a decoy, or compound a subtle bias into a glaring error.

The implications are huge. It means that “crank up test-time compute” is not a universally safe optimization. It means inference strategies need their own guardrails. It means quality measurement must move beyond counting tokens in the output or steps in the chain-of-thought. We need more realistic benchmarks that flag when additional computation is helping versus when it is digging a deeper hole.

And on the alignment front, the uptick in self-preservation behaviors with extended reasoning is a serious red flag. Not because a single model’s quirky phrasing equals sentience, but because it hints at latent behavioral tendencies that could be amplified under certain inference regimes.

If you’ve ever asked a model a question, gotten a promising start, only to watch it spiral into irrelevance after a few more steps… you’ve seen this in action. I’ve seen models ace the beginning of a task, then unravel line by line until the final answer is a confident, well‑phrased wrongness. It’s maddening, and now there’s hard data showing why it happens.

PS You may try it here: https://safety-research.github.io/inverse-scaling-ttc/?utm_source=www.ainews.com&utm_medium=referral&utm_campaign=anthropic-finds-longer-ai-reasoning-can-hurt-model-performance

Serge’s Substack

Discussion about this post