martheballon.bsky.social
@martheballon.bsky.social
Fig. 4: Accuracy declines as reasoning chains grow, this accuracy drop is significantly smaller in more proficient models. o3-mini (m) reasons more effectively than o1-mini. o3-mini (h) achieves accuracy gain over o3-mini, but uses more reasoning tokens across 𝗮𝗹𝗹 problems.
February 24, 2025 at 4:02 PM
Fig. 3: o1-mini and o3-mini (m) have a similar token distribution. Higher performing models have a better ratio of correct to incorrect answers, even for high-token regions.
February 24, 2025 at 4:02 PM
Fig. 2: Reasoning models allocate more reasoning tokens to disciplines that involve complex combinatorial reasoning. On average, token usage scales with problem complexity.
February 24, 2025 at 4:02 PM
Fig. 1: gpt-4o lags behind reasoning models o1-mini and o3-mini on Omni-MATH benchmark. o3-mini (m) and o3-mini (h) surpass 50% on all math disciplines
February 24, 2025 at 4:02 PM
LLMs are getting really good at reasoning, but mechanisms behind it are poorly understood. In our recent paper, we investigated SOTA models and found that 'Thinking harder ≠ thinking longer'!

Joint work with @andresalgaba.bsky.social @vincentginis.bsky.social

Insights of our research (A thread):
February 24, 2025 at 4:02 PM