Michael Hahn
m-hahn.bsky.social
Michael Hahn
@m-hahn.bsky.social
Prof at Saarland. NLP and machine learning.
Theory and interpretability of LLMs.
https://www.mhahn.info
8/8 Big thanks to my co-authors Alireza, Xinting, and Mark! 🔗 Read the paper: arXiv.org/abs/2502.02393
Lower Bounds for Chain-of-Thought Reasoning in Hard-Attention Transformers
Chain-of-thought reasoning and scratchpads have emerged as critical tools for enhancing the computational capabilities of transformers. While theoretical results show that polynomial-length scratchpad...
arXiv.org
May 5, 2025 at 12:25 PM
7/8 Takeaway: There’s no free lunch: long CoTs are information-theoretically necessary for these tasks. Any compression tricks will hit these hard limits unless we revamp the model architecture itself.
May 5, 2025 at 12:25 PM
6/8 Graph reachability: Given a DAG and two nodes, is there a path? You might try to “cheat” with clever CoTs—but in fact you need Ω(|E| log |V|) steps, matching BFS’s runtime (plus indexing overhead).
May 5, 2025 at 12:25 PM
5/8 Multiplication: LLMs notoriously flub middle digits of big-integer products. Those digits hinge on all input bits, so you get a linear CoT lower bound. Our best upper bound uses Schönhage–Strassen in O(n log n) steps— closing the log(n) gap is open.
May 5, 2025 at 12:25 PM
4/8 Regular languages: Hard-attention Transformers without CoT sit in AC⁰. To recognize anything beyond AC⁰—e.g., Parity—you need at least linear-length CoTs. Sublinear won’t cut it.
May 5, 2025 at 12:25 PM
3/8 Yes! We derive tight lower bounds for several algorithmic tasks. The theory assumes hard attention, but soft-attention experiments paint the same picture.
May 5, 2025 at 12:25 PM
2/8 Prior work shows that CoT-equipped Transformers can simulate P-time computations via polynomial-length traces. But those broad results don’t pin down exactly how long you need to solve a specific problem. We ask: can we get fine-grained lower bounds on CoT length?
May 5, 2025 at 12:25 PM