Dylan Foster 🐢
banner
djfoster.bsky.social
Dylan Foster 🐢
@djfoster.bsky.social
Principal Researcher in AI/ML/RL Theory @ Microsoft Research NE/NYC. Previously @ MIT, Cornell. http://dylanfoster.net


RL Theory Lecture Notes: https://arxiv.org/abs/2312.16730
(12/12) Another algo example: We give some tournament-style methods for selecting checkpoints (motivated by coverage) that can beat cross-entropy based selection. See figure, where coverage-based selection finds model w/ best BoN performance.
October 25, 2025 at 4:21 PM
(10/12) Perhaps the coolest part: the coverage perspective seems to have non-trivial algo design consequences:

1. Vanilla SGD can have poor coverage; gradient normalization (a-la Adam) fixes this

2. Test-time training (SGD on tokens during generation) provably improves coverage
October 25, 2025 at 4:21 PM
(8/12) Our analysis is based on a phenomenon we call the coverage principle, where next-token prediction implicitly optimizes toward a model with good coverage.

Neat consequence: Coverage generalizes faster as we go further into tail (corresponding to more test-time compute)
October 25, 2025 at 4:21 PM
(7/12) Example (see figure):
- Cross-entropy decreases throughout training.
- Coverage improves to a point, but begins to drop as the model learns a spurious shortcut.
- BoN performance follows trend of coverage, not CE (increasing initially, dropping as shortcut is learned).
October 25, 2025 at 4:21 PM
(6/12) Our main result:

1) Next-token prediction (more generally, MLE) provably learns a model w/ good coverage, inheriting the training corpus’ coverage over tasks of interest.

2) Coverage generalizes *faster* than cross-entropy itself, and consequently can be more predictive.
October 25, 2025 at 4:21 PM
(3/12) Our answer is a quantity we call the *coverage profile*, which captures the extent to which the pre-trained model "covers" the pre-training distribution:

Cov_N = P_{π}[π(y|x)/\hat{π}(y|x) ≥ N],

(π is pre-training dist, \hat{π} is pre-trained model, N is a param)
October 25, 2025 at 4:20 PM
(1/12) Many have observed starting that from a better next‑token predictor (low cross-entropy) does not always give better downstream post-training/test-time perf. In some cases, CE can even be anti‑correlated with BoN success (see fig or arxiv.org/abs/2502.07154). Why is this?
October 25, 2025 at 4:20 PM
Led by my amazing intern Fan Chen, with awesome team Audrey Huang (@ahahaudrey.bsky.social), Noah Golowich, Sadhika Malladi (@sadhika.bsky.social), Adam Block, Jordan Ash, and Akshay Krishnamurthy.

Paper: www.arxiv.org/abs/2510.15020

Thread below.
October 25, 2025 at 4:20 PM
The coverage principle: How pre-training enables post-training

New preprint where we look at the mechanisms through which next-token prediction produces models that succeed at downstream tasks.

The answer involves a metric we call the "coverage profile", not cross-entropy.
October 25, 2025 at 4:20 PM
(10/12) Perhaps the coolest part: the coverage perspective seems to have non-trivial algo design consequences:

1. Vanilla SGD can have poor coverage; gradient normalization (a-la Adam) fixes this

2. Test-time training (SGD on tokens during generation) provably improves coverage
October 25, 2025 at 4:17 PM
Empirically, we have only tried this w/ small-ish scale so far, but find consistently that VGB outperforms textbook algos on either (1) accuracy; or (2) diversity when compute-normalized. Ex: for Dyck language, VGB escapes the accuracy-diversity frontier for baselines algos.
October 12, 2025 at 4:26 PM
Main guarantee:
- As long as you have exact/verifiable outcome rewards, always converges to optimal distribution.
- Runtime depends on process verifier quality, gracefully degrading as quality gets worse.
October 12, 2025 at 4:26 PM
VGB generalizes the Sinclair–Jerrum '89 random walk (people.eecs.berkeley.edu/~sinclair/ap...)
from TCS (used to prove equivalence of apx. counting & sampling for self-reducible problems), linking test-time RL/alignment with discrete sampling theory. We are super excited about this connection.
October 12, 2025 at 4:26 PM
We give a new algo, Value-Guided Backtracking (VGB), where the idea is to view autoregressive generation as a random walk on the tree of partial outputs, and add a *stochastic backtracking* step—occasionally erasing tokens in a principled way—to counter error amplification.
October 12, 2025 at 4:26 PM
Taming Imperfect Process Verifiers: A Sampling Perspective on Backtracking.

A totally new framework based on ~backtracking~ for using process verifiers to guide inference, w/ connections to approximate counting/sampling in theoretical CS.

Paper: www.arxiv.org/abs/2510.03149
October 12, 2025 at 4:26 PM
Announcing the first workshop on Foundations of Language Model Reasoning (FoRLM) at NeurIPS 2025!

📝Soliciting abstracts that advance foundational understanding of reasoning in language models, from theoretical analyses to rigorous empirical studies.

📆 Deadline: Sept 3, 2025
August 11, 2025 at 3:40 PM
For those at ICML, Audrey will be presenting this paper at the 4:30pm poster session this afternoon! West Exhibition Hall B2-B3 W-1009
July 15, 2025 at 3:59 PM
Dhruv Rohatgi will be giving a lecture on our recent work on comp-stat tradeoffs in next-token prediction at the RL Theory virtual seminar series (rl-theory.bsky.social) tomorrow at 2pm EST! Should be a fun talk---come check it out!!
May 26, 2025 at 7:19 PM
Website (& CFP/instructions): fopt-workshop.github.io

Submission link: openreview.net/group?id=lea...

See you in Lyon!
May 9, 2025 at 5:10 PM
Announcing the first workshop on Foundations of Post-Training (FoPT) at COLT 2025!

📝 Soliciting abstracts/posters exploring theoretical & practical aspects of post-training and RL with language models!

🗓️ Deadline: May 19, 2025
May 9, 2025 at 5:10 PM
Empirically, we find that InferenceTimePessimism is robust to reward hacking across tasks, base models, and reward models.

8/11
May 3, 2025 at 5:40 PM
Result #2: To address these shortcomings, we propose a new method: InferenceTimePessimism.

ITP leverages pessimism about reward model uncertainty to robustly avoid reward hacking.

5/11
May 3, 2025 at 5:40 PM
Inference-time computation can boost language model performance, but scaling naively—like with Best-of-N sampling (selecting the top-scoring response from multiple candidates with a reward model)—can actually degrade results through reward hacking (eg, this fig from arxiv.org/abs/2210.10760).

2/11
May 3, 2025 at 5:40 PM
Is Best-of-N really the best we can do for language model inference?

New paper (appearing at ICML) led by the amazing Audrey Huang (ahahaudrey.bsky.social) with Adam Block, Qinghua Liu, Nan Jiang, and Akshay Krishnamurthy (akshaykr.bsky.social).

1/11
May 3, 2025 at 5:40 PM
Main result #4: Finally, we show that under additional representational assumptions, one can achieve improved runtime (replacing sequence-level coverage with action-level coverage) through multi-turn exploration, highlighting a provable (computational vs. statistical) benefit of multi-turn.

16/
March 27, 2025 at 5:28 PM