The Illusion of Thinking: Why Even the Smartest AI Models Struggle to Truly Reason
The Illusion of Thinking: Why Even the Smartest AI Models Struggle to Truly Reason
In June 2025, Apple published a research paper titled The Illusion of Thinking, and it couldn’t be more aptly named. In this study, Apple’s researchers pulled back the curtain on what we often assume about Large Language Models (LLMs): that their convincing chain-of-thought answers reflect actual reasoning. Spoiler alert—they don’t. Or at least, not reliably.
This post breaks down what Apple discovered, how it compares with tools like ChatGPT, GitHub Copilot, DeepSeek, and Claude, and what it all means for the future of “thinking” machines.
What Apple Found: Simulated Thinking Falls Apart
Apple coined the term Large Reasoning Models (LRMs) for LLMs explicitly trained or prompted to reason step-by-step—like how ChatGPT uses “Let’s think step by step.” These models were tested on logical puzzles like Tower of Hanoi, River Crossing, and Blocks World, where complexity can be scaled in measurable ways.
Their findings are stark:
- At low complexity, baseline LLMs (without explicit reasoning steps) often outperform LRMs.
- At medium complexity, LRMs shine—chain-of-thought reasoning really does help.
- At high complexity, everything collapses. LRMs stop trying. Literally—they reduce the number of reasoning steps even when they have enough context tokens left. Researchers called this effort collapse.
Even when fed the correct algorithm, LRMs like ChatGPT-style models couldn’t reliably follow it. They appeared to be reasoning, but in reality, they were regurgitating patterns.
Apple’s verdict? Much of what we call "thinking" in AI is just a well-dressed prediction engine.
Now Compare That to ChatGPT, Copilot, and DeepSeek
ChatGPT (OpenAI)
OpenAI’s ChatGPT—especially with chain-of-thought prompting—feels like it reasons. But try feeding it a recursive logic puzzle or a dynamic problem that requires conditional planning (like a game or optimization problem), and it starts to hallucinate steps or "give up" early—similar to Apple’s findings.
Strength: Amazing at low- and medium-complexity reasoning (e.g., business logic, content strategy, structured problem solving).
Weakness: Fails gracefully—often sounding right while being wrong when the logic gets hard.
GitHub Copilot (OpenAI + Microsoft)
Copilot does well in pattern-heavy code generation tasks, but give it a deeply recursive or abstract problem (like writing an optimal chess engine or solving NP-complete tasks), and you hit the illusion again. It starts generating loops without exit conditions or produces approximate solutions that seem right, but fail edge cases.
Strength: Highly practical for boilerplate, CRUD apps, and autocomplete.
Weakness: Doesn’t “understand” the logic behind what it writes.
DeepSeek (Search-Augmented LLM)
DeepSeek augments its responses by retrieving and summarizing relevant information. This often improves factual correctness, but doesn't fundamentally change its reasoning capacity. In fact, even with accurate documentation, it struggles with tasks that require building abstract representations or long-step problem solving—like the puzzles Apple used.
Strength: Better memory recall, better citations.
Weakness: Still a pattern matcher, not a reasoner.
Claude (Anthropic)
Claude excels in staying coherent across longer contexts. This allows it to sound more thoughtful, but when it comes to algorithmic or symbolic reasoning, Claude also exhibits the same pattern collapse—fluent but logically incorrect answers, especially on edge-case logic.
Why This Matters: Productivity ≠ Reasoning
The AI tools we use daily—ChatGPT, Copilot, Claude—are amazing productivity boosters. They write, summarize, suggest, and generate. But Apple's research cautions us not to conflate linguistic fluency with logical understanding.
For example:
- Copilot writing working code ≠ understanding why it works.
- ChatGPT giving a convincing math proof ≠ solving the problem step-by-step like a human.
- DeepSeek citing an accurate Wikipedia article ≠ inferring causality or abstraction from it.
The Future: Symbolic Reasoning + LLMs?
Apple’s work suggests we may need a hybrid approach—neural + symbolic AI—to get true reasoning:
- Symbolic logic engines or tree-search modules could handle algorithmic tasks.
- LLMs could continue serving as the interface layer—good at expression and intuition, not logic crunching.
Imagine a ChatGPT that can “call” a logic engine in the background, or a Copilot that validates its code against a static analyzer mid-generation. That might bridge the gap between what looks smart and what is smart.
Final Thoughts
Apple's The Illusion of Thinking is a timely reminder that today’s AIs aren't reasoners—they’re reflectors. They mirror the vast corpus of human language and logic, but their depth is still surface-level when the tasks push past a certain threshold of complexity.
So the next time your AI assistant confidently solves a logic puzzle—or writes a “working” algorithm -remember: it may just be an illusion.
Comments