Research on Large Reasoning Models (LRMs) by Apple demonstrates that as puzzle complexities rise, there is a significant collapse in reasoning capability. The study employed various puzzles including the Tower of Hanoi to evaluate LRM performance across three complexity regimes. Initially, both reasoning and non-reasoning models perform similarly on simple puzzles. With medium complexity, reasoning models exhibited superior performance, but at high complexity, all models experienced a collapse to zero performance. These findings indicate substantial limitations in LRM reasoning and challenge existing beliefs about their capabilities.
Apple researchers investigated the abilities of Large Reasoning Models (LRMs) on puzzles, revealing a critical collapse threshold that limits their reasoning scalability.
As puzzle complexity increases, reasoning models perform well at first, outperforming LLMs, but both eventually collapse to zero performance at high complexity levels.
The study reveals that despite sophisticated self-reflective mechanisms, current LRMs struggle with generalizable reasoning beyond certain complexity thresholds, challenging assumptions about their capabilities.
LRMs like o3 and DeepSeek-R1, designed to generate step-by-step instructions, initially perform better than standard LLMs but encounter fundamental limitations at higher complexity.
Collection
[
|
...
]