Lightnews — Scholar-powered news

awni.bsky.social

@awni.bsky.social

9 followers 42 following 10 posts

phd student @ yale statistics & data science

studying the foundations of machine intelligence

awni.xyz

Posts Replies Media Videos

awni.bsky.social

@awni.bsky.social

Check out the full paper with Omar Montasser & John Lafferty!

* Paper: arxiv.org/pdf/2505.15927
* Blog: awni.xyz/cot-info/

And come by our poster at NeurIPS in San Diego: neurips.cc/virtual/2025...

#NeurIPS2025 #MachineLearningTheory #LLM #ChainOfThought
[10/10]

arxiv.org

November 25, 2025 at 4:27 AM

awni.bsky.social

@awni.bsky.social

🔭 Implications for LLM research

* When designing annotation pipelines, investing in rich reasoning traces boosts data efficiency.
* Their value depends on how much internal computation they reveal.
* Enables measuring “trace quality” through an information-theoretic lens.

[9/n]

November 25, 2025 at 4:27 AM

awni.bsky.social

@awni.bsky.social

🧪 Theory meets Practice

We empirically validate our theory’s predictions in simple settings where the CoT information can be computed exactly.

We find that the theory closely predicts the sample-efficiency gains.
[8/n]

November 25, 2025 at 4:27 AM

awni.bsky.social

@awni.bsky.social

🔒 CoT Information is fundamental

Our theory provides both upper and lower bounds, showing that CoT information is a fundamental measure of the power of CoT supervision.
[7/n]

November 25, 2025 at 4:27 AM

awni.bsky.social

@awni.bsky.social

🔍 Interpretation

For many reasoning tasks, CoT-Info(ε) ≫ ε, yielding much faster learning.

The CoT information CoT-Info(ε) captures the statistical advantage of CoT data.

CoT-Info(ε) / ε can be interpreted as the relative value of a CoT sample compared to an end-to-end sample.
[6/n]

November 25, 2025 at 4:27 AM

awni.bsky.social

@awni.bsky.social

🧮 The Theory

To distinguish between hypotheses with error ε, classical theory tells us we need roughly O(1/ε) samples.

We prove that under CoT supervision, the sample complexity improves to O(1/CoTInfo(ε)).
[5/n]

Mathematical result on the sample complexity of learning with CoT supervision.

November 25, 2025 at 4:27 AM

awni.bsky.social

@awni.bsky.social

🧠 The Insight: CoT supervision doesn’t just tell the model what to predict; it constrains how it thinks.

We formalize this by introducing the “CoT Information”: a measure of the extra discriminative power gained by observing the reasoning trace, not just the label.
[4/n]

Mathematical definition of the CoT information.

November 25, 2025 at 4:27 AM

awni.bsky.social

@awni.bsky.social

💡 Core Problem: Training a model on only Input→Output (end-to-end) is like teaching a student math by showing them only the final answers.

To learn complex reasoning this way, you need a massive amount of data to rule out all the “wrong ways” to get the “right answer.”
[3/n]

November 25, 2025 at 4:27 AM

awni.bsky.social

@awni.bsky.social

Large language models have been transformed by the shift from “learn to predict the final answer” to “learn to predict the reasoning process” via chain-of-thought supervision.

Can we understand why this works from a statistical lens and quantify the advantage?
[2/n]

November 25, 2025 at 4:27 AM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news