Dylan Foster 🐢
banner
djfoster.bsky.social
Dylan Foster 🐢
@djfoster.bsky.social
Principal Researcher in AI/ML/RL Theory @ Microsoft Research NE/NYC. Previously @ MIT, Cornell. http://dylanfoster.net


RL Theory Lecture Notes: https://arxiv.org/abs/2312.16730
(12/12) Paper link: www.arxiv.org/abs/2510.15020

Many interesting questions: What are the minimal conditions for RL success (maybe a more semantic notion of coverage)? Do other algos/interventions secretly have coverage-based interpretations?
October 25, 2025 at 4:21 PM
(12/12) Another algo example: We give some tournament-style methods for selecting checkpoints (motivated by coverage) that can beat cross-entropy based selection. See figure, where coverage-based selection finds model w/ best BoN performance.
October 25, 2025 at 4:21 PM
(10/12) Perhaps the coolest part: the coverage perspective seems to have non-trivial algo design consequences:

1. Vanilla SGD can have poor coverage; gradient normalization (a-la Adam) fixes this

2. Test-time training (SGD on tokens during generation) provably improves coverage
October 25, 2025 at 4:21 PM
(9/12) We show this through two lenses:
1. New generalization analysis for next-token prediction/maximum likelihood for general function classes, in the vein of stat. learning.
2. SGD in the one-pass regime for overparameterized autoregressive linear models
October 25, 2025 at 4:21 PM
(8/12) Our analysis is based on a phenomenon we call the coverage principle, where next-token prediction implicitly optimizes toward a model with good coverage.

Neat consequence: Coverage generalizes faster as we go further into tail (corresponding to more test-time compute)
October 25, 2025 at 4:21 PM
(7/12) Example (see figure):
- Cross-entropy decreases throughout training.
- Coverage improves to a point, but begins to drop as the model learns a spurious shortcut.
- BoN performance follows trend of coverage, not CE (increasing initially, dropping as shortcut is learned).
October 25, 2025 at 4:21 PM
(6/12) Our main result:

1) Next-token prediction (more generally, MLE) provably learns a model w/ good coverage, inheriting the training corpus’ coverage over tasks of interest.

2) Coverage generalizes *faster* than cross-entropy itself, and consequently can be more predictive.
October 25, 2025 at 4:21 PM
(5/12) Example: For tasks that are well-represented in pre-training distribution, low Cov_N is necessary and sufficient for Best-of-N to succeed.

Empirically, some notion of coverage also seems necessary for current RL methods (eg arxiv.org/abs/2504.13837)
October 25, 2025 at 4:20 PM
(4/12) Intuition: Cross-entropy (KL) tracks the mean of the log density ratio, while coverage profile tracks the *tail* or CDF via N.

N controls how far into tail we go, and roughly corresponds to test-time compute/RL effort. For downstream performance, the tail is what matters.
October 25, 2025 at 4:20 PM
(3/12) Our answer is a quantity we call the *coverage profile*, which captures the extent to which the pre-trained model "covers" the pre-training distribution:

Cov_N = P_{π}[π(y|x)/\hat{π}(y|x) ≥ N],

(π is pre-training dist, \hat{π} is pre-trained model, N is a param)
October 25, 2025 at 4:20 PM
(2/12) We wanted to take a first step toward developing some theoretical understanding of the relationship btw next-token prediction and downstream perf (particularly w/ test-time scaling).

Concretely, what metrics of the pre-trained model can we link to downstream success?
October 25, 2025 at 4:20 PM
(1/12) Many have observed starting that from a better next‑token predictor (low cross-entropy) does not always give better downstream post-training/test-time perf. In some cases, CE can even be anti‑correlated with BoN success (see fig or arxiv.org/abs/2502.07154). Why is this?
October 25, 2025 at 4:20 PM
Led by my amazing intern Fan Chen, with awesome team Audrey Huang (@ahahaudrey.bsky.social), Noah Golowich, Sadhika Malladi (@sadhika.bsky.social), Adam Block, Jordan Ash, and Akshay Krishnamurthy.

Paper: www.arxiv.org/abs/2510.15020

Thread below.
October 25, 2025 at 4:20 PM
(10/12) Perhaps the coolest part: the coverage perspective seems to have non-trivial algo design consequences:

1. Vanilla SGD can have poor coverage; gradient normalization (a-la Adam) fixes this

2. Test-time training (SGD on tokens during generation) provably improves coverage
October 25, 2025 at 4:17 PM
Sorry about that! There was an issue where the application briefly closed early. Today is the deadline and it should be open again now.
October 22, 2025 at 7:39 PM
Lots of interesting directions here! We think there is a lot more to do building on the connection to the discrete sampling/TCS literature and algos from this space, as well as moving beyond autoregressive generation.
October 12, 2025 at 4:26 PM
Empirically, we have only tried this w/ small-ish scale so far, but find consistently that VGB outperforms textbook algos on either (1) accuracy; or (2) diversity when compute-normalized. Ex: for Dyck language, VGB escapes the accuracy-diversity frontier for baselines algos.
October 12, 2025 at 4:26 PM
Main guarantee:
- As long as you have exact/verifiable outcome rewards, always converges to optimal distribution.
- Runtime depends on process verifier quality, gracefully degrading as quality gets worse.
October 12, 2025 at 4:26 PM
VGB generalizes the Sinclair–Jerrum '89 random walk (people.eecs.berkeley.edu/~sinclair/ap...)
from TCS (used to prove equivalence of apx. counting & sampling for self-reducible problems), linking test-time RL/alignment with discrete sampling theory. We are super excited about this connection.
October 12, 2025 at 4:26 PM
We give a new algo, Value-Guided Backtracking (VGB), where the idea is to view autoregressive generation as a random walk on the tree of partial outputs, and add a *stochastic backtracking* step—occasionally erasing tokens in a principled way—to counter error amplification.
October 12, 2025 at 4:26 PM
Test-time guidance with learned process verifiers has untapped potential to enhance language model reasoning, but one of the issues with getting this to actually work is that small verifier mistakes are amplified by textbook algos (e.g., block-wise BoN), w/ errors compounding as length increases.
October 12, 2025 at 4:26 PM
With amazing team: Dhruv Rohatgi, Abhishek Shetty, Donya Saless, Yuchen Li, Ankur Moitra, and Andrej Risteski (andrejristeski.bsky.social).

Short thread below.
Andrej Risteski (@andrejristeski.bsky.social)
Machine learning researcher. Professor in ML department at CMU.
andrejristeski.bsky.social
October 12, 2025 at 4:26 PM