studying the foundations of machine intelligence
awni.xyz
* Paper: arxiv.org/pdf/2505.15927
* Blog: awni.xyz/cot-info/
And come by our poster at NeurIPS in San Diego: neurips.cc/virtual/2025...
#NeurIPS2025 #MachineLearningTheory #LLM #ChainOfThought
[10/10]
* Paper: arxiv.org/pdf/2505.15927
* Blog: awni.xyz/cot-info/
And come by our poster at NeurIPS in San Diego: neurips.cc/virtual/2025...
#NeurIPS2025 #MachineLearningTheory #LLM #ChainOfThought
[10/10]
* When designing annotation pipelines, investing in rich reasoning traces boosts data efficiency.
* Their value depends on how much internal computation they reveal.
* Enables measuring “trace quality” through an information-theoretic lens.
[9/n]
* When designing annotation pipelines, investing in rich reasoning traces boosts data efficiency.
* Their value depends on how much internal computation they reveal.
* Enables measuring “trace quality” through an information-theoretic lens.
[9/n]
We empirically validate our theory’s predictions in simple settings where the CoT information can be computed exactly.
We find that the theory closely predicts the sample-efficiency gains.
[8/n]
We empirically validate our theory’s predictions in simple settings where the CoT information can be computed exactly.
We find that the theory closely predicts the sample-efficiency gains.
[8/n]
Our theory provides both upper and lower bounds, showing that CoT information is a fundamental measure of the power of CoT supervision.
[7/n]
Our theory provides both upper and lower bounds, showing that CoT information is a fundamental measure of the power of CoT supervision.
[7/n]
For many reasoning tasks, CoT-Info(ε) ≫ ε, yielding much faster learning.
The CoT information CoT-Info(ε) captures the statistical advantage of CoT data.
CoT-Info(ε) / ε can be interpreted as the relative value of a CoT sample compared to an end-to-end sample.
[6/n]
For many reasoning tasks, CoT-Info(ε) ≫ ε, yielding much faster learning.
The CoT information CoT-Info(ε) captures the statistical advantage of CoT data.
CoT-Info(ε) / ε can be interpreted as the relative value of a CoT sample compared to an end-to-end sample.
[6/n]
To distinguish between hypotheses with error ε, classical theory tells us we need roughly O(1/ε) samples.
We prove that under CoT supervision, the sample complexity improves to O(1/CoTInfo(ε)).
[5/n]
To distinguish between hypotheses with error ε, classical theory tells us we need roughly O(1/ε) samples.
We prove that under CoT supervision, the sample complexity improves to O(1/CoTInfo(ε)).
[5/n]
We formalize this by introducing the “CoT Information”: a measure of the extra discriminative power gained by observing the reasoning trace, not just the label.
[4/n]
We formalize this by introducing the “CoT Information”: a measure of the extra discriminative power gained by observing the reasoning trace, not just the label.
[4/n]
To learn complex reasoning this way, you need a massive amount of data to rule out all the “wrong ways” to get the “right answer.”
[3/n]
To learn complex reasoning this way, you need a massive amount of data to rule out all the “wrong ways” to get the “right answer.”
[3/n]
Can we understand why this works from a statistical lens and quantify the advantage?
[2/n]
Can we understand why this works from a statistical lens and quantify the advantage?
[2/n]