A Critical Review of Apple’s LLM Research on Reasoning Models

The latest Apple research paper, published last week, The Illusion of Thinking, arrives with a deceptively humble premise: to analyse the reasoning capabilities of frontier LLMs under controlled puzzle environments. But despite the modest framing, this paper marks an ambitious intervention. It sets out to expose the cognitive fault lines of “thinking” models, such as those that pad their outputs with step-by-step chains-of-thought and self-reflection routines.

And yet, despite the rigor, the paper ends up revealing something deeper than it claims: despite the scaling collapse of reasoning, there is a great the superficiality of what we continue to call reasoning in language models.

Controlled Complexity

First, credit where it’s due. The study wisely abandons math benchmarks that have long been riddled with contamination and memorization loopholes. Instead, it opts for algorithmic puzzles like Tower of Hanoi, Blocks World, River Crossing, and Checker Jumping. These environments are adjustable in complexity, synthetically clean and logically constrained. They simulate reasoning not as pattern retrieval, but as planning under rule-based transformation.

Let’s quickly explain each puzzle environment which stresses a different aspect of structured reasoning for LLMs:

Tower of Hanoi tests recursive strategy and state-tracking. The challenge grows exponentially with disk count, requiring the model to simulate abstract planning and preserve order across subgoals.
Blocks World is a classic task from AI planning literature. Models must rearrange blocks by manipulating only top elements, respecting stack constraints. It demands sequencing, temporary placement logic, and dependency resolution.
River Crossing requires transporting actor-agent pairs across a river while enforcing safety constraints (e.g., actors cannot be left with unaccompanied agents). It simulates constraint satisfaction and group dynamics where violating a rule breaks the entire sequence.
Checker Jumping forces linear transformations of tokens with directional, colour-based constraints. It combines permutation logic and move optimization, framed within a strict movement rule set.

Together, these puzzles represent a targeted attempt to strip away ambiguity and evaluate reasoning in isolation from linguistic or factual context. The point is that the models are not being asked to know things but is expected to plan them.

By focusing on process instead of just output, the authors gain access to an underexamined dimension: the reasoning trace. They are not asking whether the model gets the right answer. They ask when the model thinks it’s found it, how many tokens it spends, and what happens to its thinking pattern as difficulty increases. What emerges is a taxonomy of failure.

Three Regimes, One Architecture

The paper identifies three distinct performance regimes across reasoning models:

Low complexity regime: In simpler puzzles, standard language models – those that output only the final answer without elaborate chains-of-thought, actually outperform their “thinking” counterparts. They do this with fewer tokens and higher accuracy. This suggests that reasoning traces, in these scenarios, introduce noise rather than clarity.
Medium complexity regime: As tasks become moderately more complex – longer sequences, more decision branches, thinking models begin to outperform. Their step-by-step structure provides scaffolding that helps them stay on track and self-correct. This is the only zone where “thinking” appears genuinely helpful.
High complexity regime: Beyond a certain threshold, both standard and thinking models collapse. Accuracy drops to zero, regardless of reasoning format. But what’s more alarming is that thinking effort (measured in tokens spent) also begins to decline. The models start thinking less even as the task becomes harder.

These regimes persist across different puzzle environments and different model families of LLMs (Claude, DeepSeek, OpenAI GPT-series). Importantly, the architecture does not change—only the output format. The “thinking” models are not new classes of machines. They’re scaffolding variants.

This makes the collapse more interesting. The models are not running out of compute. They are running out of coherence.

Effort Declines Before the Cliff

One of the most striking and underappreciated findings of the paper is visualized in the middle panel of Figure 1 as shown below. As complexity increases, models initially allocate more tokens to their reasoning. But past a threshold, they begin to think less, not more. And not because they’re constrained, since token limits are not reached.

This should concern anyone who believes chain-of-thought scaling is the path to general intelligence. These models are not collapsing under pressure, but withdrawing effort while still capable of continuing.

Whether this is due to degraded internal state representations, recursive incoherence, or emergent local minima is beyond the scope here. What matters is that the illusion breaks down, as the paper’s title. “Thinking” is not learning, in AI it’s a verbosity protocol.

The Myth of More Tokens

The paper extends its analysis from puzzles to traditional math benchmarks to test LLMs: MATH500, AIME24, and AIME25. Here, it uses pass@k as a metric – essentially, whether the correct answer appears in any of the top k sampled completions within a token budget.

This is a standard evaluation method for probabilistic generation tasks like math or code. The idea is that if a model can eventually guess the correct path across multiple attempts, it may possess the latent structure to get there, even if it doesn’t do so consistently in a single shot.

Let’s explain what the researchers find is instructive, and include their graph:

On MATH500, a dataset heavily represented in model pretraining, both reasoning and non-reasoning models converge to similar pass@k rates when given the same compute. In other words, verbosity doesn’t help when the problem is already statistically familiar.
On AIME24, thinking models start to gain an advantage. This could imply they’re more robust to slightly higher compositional complexity or less overfit data.
On AIME25, however, performance drops for both. Notably, human scores were higher on AIME25 than AIME24, suggesting that it was not harder. The drop in model performance is more likely explained by reduced training contamination or memorization.

This creates interpretive ambiguity. Are thinking models truly better reasoners? Or are they just leveraging training exposure and compute differently?

The results are inconclusive because pass@k, like final answer accuracy, does not capture the how of reasoning. It’s just a brute-force metric. The model is either right or it isn’t, regardless of how structured or hollow its thinking path may have been.

Pass@k is a metric that measures whether at least one of the top k generated outputs from a model contains a correct solution. It’s commonly used to evaluate probabilistic tasks like code generation or math problems, where a model may not get the answer right on the first try but can succeed across multiple samples.

What emerges is a core contradiction. If models converge in performance under equivalent compute, then reasoning traces may simply be a form of inference-time sampling, not a sign of deeper capability. And if performance diverges only on cleaner datasets, we must ask: what exactly is being tested? Reasoning or retention?

Overthinking as Degradation

The paper offers detailed analysis of reasoning traces, showing a curious phenomenon: models often stumble upon the correct solution early, then discard it as they continue to “think.” This is what the literature calls overthinking, but the term is misleading. It implies excess effort. In reality, this is semantic decay. The model cannot retain or trust its early insight.

In simpler tasks, this means wasting compute. In harder tasks, it means the model loops through invalid paths, never returning to the earlier correct one. And crucially, as complexity rises, correct solutions tend to occur later and less frequently until they vanish altogether.

Algorithms Don’t Save You

One of the paper’s most interesting sections is the controlled prompt experiments in Tower of Hanoi. Researchers provide the exact solution algorithm in the prompt. All the model has to do is execute it. However, Performance does not improve. If LLMs could reason symbolically, or even just copy steps reliably, this should have been trivial. Instead, they collapse at the same point as if they were solving from scratch.

Any argument that future frontier models will “just learn to code” their way through logic puzzles, in my opinion, should contend with this: you can hand them the code, and they’ll still fail to run it.

This also undermines the fantasy that we’re building emergent cognitive agents. These are autoregressive writers, not your business or daily dynamic planners. The fact that their collapse curves look smooth only hides how brittle the mechanism is.

Failure as Performance

What The Illusion of Thinking ultimately reveals is that we’re still confusing output performance with internal capability. These models produce verbose chains-of-thought not because they’re simulating cognition, but because they’ve been trained to externalize patterns that look like it.

When those patterns grow too long, the coherence falls apart. Not due to hardware or budget, but because there is no internal model of the task. No working memory, no persistent state, no agenda. It doesn’t matter how many tokens they’re allowed. They are not thinking longer, just repeating longer or getting in a cycle.

As we argued in Safety Alignment vs. Jailbreaking: From Ethical LLMs Like ChatGPT to the Rise of Dark Models, performative complexity itself, whether via safety refusals or aesthetic generation, is not evidence of deeper cognitive function. Apple’s study reaffirms this: verbosity does not equal reasoning, and more compute does not equal capability.

Towards Real Evaluation

To Apple’s credit, this paper is not selling the dream. It exposes failure modes, characterizes them with precision and resists speculative claims. But it also operates within the same assumption that has plagued this field: that “thinking” is something these models can do if we tune the scaffolding just right.

What it misses is the broader context: reasoning is not about tracing paths through token space. It is about holding concepts, transforming them, and choosing goals. None of these things are visible in puzzle traces or math outputs. What we need is not just more controlled environments. We need evaluation paradigms that measure abstraction, semantic conflict resolution and transfer of principles across domains. Until then, reasoning remains just another formatting trick.

Final Thoughts: Recognizing the Simulation

The paper calls its LLM models Large Reasoning Models. But what it documents is not reasoning, but a language game about planning. And like all good simulations, it performs well until the fidelity breaks. Then, the cracks show exactly what was there.

The illusion, it turns out, is not just in the LLMs and models. It’s in our desire to believe they’re doing our labour of thinking.

A Critical Review of Apple’s LLM Research on Reasoning Models