awni.bsky.social
awni.bsky.social
awni.bsky.social
@awni.bsky.social
phd student @ yale statistics & data science

studying the foundations of machine intelligence

awni.xyz
Check out the full paper with Omar Montasser & John Lafferty!

* Paper: arxiv.org/pdf/2505.15927
* Blog: awni.xyz/cot-info/

And come by our poster at NeurIPS in San Diego: neurips.cc/virtual/2025...

#NeurIPS2025 #MachineLearningTheory #LLM #ChainOfThought
[10/10]
arxiv.org
November 25, 2025 at 4:27 AM
🔭 Implications for LLM research

* When designing annotation pipelines, investing in rich reasoning traces boosts data efficiency.
* Their value depends on how much internal computation they reveal.
* Enables measuring “trace quality” through an information-theoretic lens.

[9/n]
November 25, 2025 at 4:27 AM
🧪 Theory meets Practice

We empirically validate our theory’s predictions in simple settings where the CoT information can be computed exactly.

We find that the theory closely predicts the sample-efficiency gains.
[8/n]
November 25, 2025 at 4:27 AM
🔒 CoT Information is fundamental

Our theory provides both upper and lower bounds, showing that CoT information is a fundamental measure of the power of CoT supervision.
[7/n]
November 25, 2025 at 4:27 AM
🔍 Interpretation

For many reasoning tasks, CoT-Info(ε) ≫ ε, yielding much faster learning.

The CoT information CoT-Info(ε) captures the statistical advantage of CoT data.

CoT-Info(ε) / ε can be interpreted as the relative value of a CoT sample compared to an end-to-end sample.
[6/n]
November 25, 2025 at 4:27 AM
🧮 The Theory

To distinguish between hypotheses with error ε, classical theory tells us we need roughly O(1/ε) samples.

We prove that under CoT supervision, the sample complexity improves to O(1/CoTInfo(ε)).
[5/n]
November 25, 2025 at 4:27 AM
🧠 The Insight: CoT supervision doesn’t just tell the model what to predict; it constrains how it thinks.

We formalize this by introducing the “CoT Information”: a measure of the extra discriminative power gained by observing the reasoning trace, not just the label.
[4/n]
November 25, 2025 at 4:27 AM
💡 Core Problem: Training a model on only Input→Output (end-to-end) is like teaching a student math by showing them only the final answers.

To learn complex reasoning this way, you need a massive amount of data to rule out all the “wrong ways” to get the “right answer.”
[3/n]
November 25, 2025 at 4:27 AM
Large language models have been transformed by the shift from “learn to predict the final answer” to “learn to predict the reasoning process” via chain-of-thought supervision.

Can we understand why this works from a statistical lens and quantify the advantage?
[2/n]
November 25, 2025 at 4:27 AM