RL Theory Lecture Notes: https://arxiv.org/abs/2312.16730
1. Vanilla SGD can have poor coverage; gradient normalization (a-la Adam) fixes this
2. Test-time training (SGD on tokens during generation) provably improves coverage
1. Vanilla SGD can have poor coverage; gradient normalization (a-la Adam) fixes this
2. Test-time training (SGD on tokens during generation) provably improves coverage
Neat consequence: Coverage generalizes faster as we go further into tail (corresponding to more test-time compute)
Neat consequence: Coverage generalizes faster as we go further into tail (corresponding to more test-time compute)
- Cross-entropy decreases throughout training.
- Coverage improves to a point, but begins to drop as the model learns a spurious shortcut.
- BoN performance follows trend of coverage, not CE (increasing initially, dropping as shortcut is learned).
- Cross-entropy decreases throughout training.
- Coverage improves to a point, but begins to drop as the model learns a spurious shortcut.
- BoN performance follows trend of coverage, not CE (increasing initially, dropping as shortcut is learned).
1) Next-token prediction (more generally, MLE) provably learns a model w/ good coverage, inheriting the training corpus’ coverage over tasks of interest.
2) Coverage generalizes *faster* than cross-entropy itself, and consequently can be more predictive.
1) Next-token prediction (more generally, MLE) provably learns a model w/ good coverage, inheriting the training corpus’ coverage over tasks of interest.
2) Coverage generalizes *faster* than cross-entropy itself, and consequently can be more predictive.
Cov_N = P_{π}[π(y|x)/\hat{π}(y|x) ≥ N],
(π is pre-training dist, \hat{π} is pre-trained model, N is a param)
Cov_N = P_{π}[π(y|x)/\hat{π}(y|x) ≥ N],
(π is pre-training dist, \hat{π} is pre-trained model, N is a param)
Paper: www.arxiv.org/abs/2510.15020
Thread below.
Paper: www.arxiv.org/abs/2510.15020
Thread below.
New preprint where we look at the mechanisms through which next-token prediction produces models that succeed at downstream tasks.
The answer involves a metric we call the "coverage profile", not cross-entropy.
New preprint where we look at the mechanisms through which next-token prediction produces models that succeed at downstream tasks.
The answer involves a metric we call the "coverage profile", not cross-entropy.
1. Vanilla SGD can have poor coverage; gradient normalization (a-la Adam) fixes this
2. Test-time training (SGD on tokens during generation) provably improves coverage
1. Vanilla SGD can have poor coverage; gradient normalization (a-la Adam) fixes this
2. Test-time training (SGD on tokens during generation) provably improves coverage
- As long as you have exact/verifiable outcome rewards, always converges to optimal distribution.
- Runtime depends on process verifier quality, gracefully degrading as quality gets worse.
- As long as you have exact/verifiable outcome rewards, always converges to optimal distribution.
- Runtime depends on process verifier quality, gracefully degrading as quality gets worse.
from TCS (used to prove equivalence of apx. counting & sampling for self-reducible problems), linking test-time RL/alignment with discrete sampling theory. We are super excited about this connection.
from TCS (used to prove equivalence of apx. counting & sampling for self-reducible problems), linking test-time RL/alignment with discrete sampling theory. We are super excited about this connection.
A totally new framework based on ~backtracking~ for using process verifiers to guide inference, w/ connections to approximate counting/sampling in theoretical CS.
Paper: www.arxiv.org/abs/2510.03149
A totally new framework based on ~backtracking~ for using process verifiers to guide inference, w/ connections to approximate counting/sampling in theoretical CS.
Paper: www.arxiv.org/abs/2510.03149
📝Soliciting abstracts that advance foundational understanding of reasoning in language models, from theoretical analyses to rigorous empirical studies.
📆 Deadline: Sept 3, 2025
📝Soliciting abstracts that advance foundational understanding of reasoning in language models, from theoretical analyses to rigorous empirical studies.
📆 Deadline: Sept 3, 2025
Submission link: openreview.net/group?id=lea...
See you in Lyon!
Submission link: openreview.net/group?id=lea...
See you in Lyon!
📝 Soliciting abstracts/posters exploring theoretical & practical aspects of post-training and RL with language models!
🗓️ Deadline: May 19, 2025
📝 Soliciting abstracts/posters exploring theoretical & practical aspects of post-training and RL with language models!
🗓️ Deadline: May 19, 2025
8/11
8/11
ITP leverages pessimism about reward model uncertainty to robustly avoid reward hacking.
5/11
ITP leverages pessimism about reward model uncertainty to robustly avoid reward hacking.
5/11
2/11
2/11
New paper (appearing at ICML) led by the amazing Audrey Huang (ahahaudrey.bsky.social) with Adam Block, Qinghua Liu, Nan Jiang, and Akshay Krishnamurthy (akshaykr.bsky.social).
1/11
New paper (appearing at ICML) led by the amazing Audrey Huang (ahahaudrey.bsky.social) with Adam Block, Qinghua Liu, Nan Jiang, and Akshay Krishnamurthy (akshaykr.bsky.social).
1/11
16/
16/