Lightnews — Scholar-powered news

Dylan Foster 🐢

@djfoster.bsky.social

(12/12) Paper link: www.arxiv.org/abs/2510.15020

Many interesting questions: What are the minimal conditions for RL success (maybe a more semantic notion of coverage)? Do other algos/interventions secretly have coverage-based interpretations?

October 25, 2025 at 4:21 PM

Dylan Foster 🐢

@djfoster.bsky.social

(12/12) Another algo example: We give some tournament-style methods for selecting checkpoints (motivated by coverage) that can beat cross-entropy based selection. See figure, where coverage-based selection finds model w/ best BoN performance.

October 25, 2025 at 4:21 PM

Dylan Foster 🐢

@djfoster.bsky.social

(10/12) Perhaps the coolest part: the coverage perspective seems to have non-trivial algo design consequences:

1. Vanilla SGD can have poor coverage; gradient normalization (a-la Adam) fixes this

2. Test-time training (SGD on tokens during generation) provably improves coverage

October 25, 2025 at 4:21 PM

Dylan Foster 🐢

@djfoster.bsky.social

(9/12) We show this through two lenses:
1. New generalization analysis for next-token prediction/maximum likelihood for general function classes, in the vein of stat. learning.
2. SGD in the one-pass regime for overparameterized autoregressive linear models

October 25, 2025 at 4:21 PM

Dylan Foster 🐢

@djfoster.bsky.social

(8/12) Our analysis is based on a phenomenon we call the coverage principle, where next-token prediction implicitly optimizes toward a model with good coverage.

Neat consequence: Coverage generalizes faster as we go further into tail (corresponding to more test-time compute)

October 25, 2025 at 4:21 PM

Dylan Foster 🐢

@djfoster.bsky.social

(7/12) Example (see figure):
- Cross-entropy decreases throughout training.
- Coverage improves to a point, but begins to drop as the model learns a spurious shortcut.
- BoN performance follows trend of coverage, not CE (increasing initially, dropping as shortcut is learned).

October 25, 2025 at 4:21 PM

Dylan Foster 🐢

@djfoster.bsky.social

(6/12) Our main result:

1) Next-token prediction (more generally, MLE) provably learns a model w/ good coverage, inheriting the training corpus’ coverage over tasks of interest.

2) Coverage generalizes *faster* than cross-entropy itself, and consequently can be more predictive.

October 25, 2025 at 4:21 PM

Dylan Foster 🐢

@djfoster.bsky.social

(5/12) Example: For tasks that are well-represented in pre-training distribution, low Cov_N is necessary and sufficient for Best-of-N to succeed.

Empirically, some notion of coverage also seems necessary for current RL methods (eg arxiv.org/abs/2504.13837)

October 25, 2025 at 4:20 PM

Dylan Foster 🐢

@djfoster.bsky.social

(4/12) Intuition: Cross-entropy (KL) tracks the mean of the log density ratio, while coverage profile tracks the *tail* or CDF via N.

N controls how far into tail we go, and roughly corresponds to test-time compute/RL effort. For downstream performance, the tail is what matters.

October 25, 2025 at 4:20 PM

Dylan Foster 🐢

@djfoster.bsky.social

(3/12) Our answer is a quantity we call the *coverage profile*, which captures the extent to which the pre-trained model "covers" the pre-training distribution:

Cov_N = P_{π}[π(y|x)/\hat{π}(y|x) ≥ N],

(π is pre-training dist, \hat{π} is pre-trained model, N is a param)

October 25, 2025 at 4:20 PM

Dylan Foster 🐢

@djfoster.bsky.social

(2/12) We wanted to take a first step toward developing some theoretical understanding of the relationship btw next-token prediction and downstream perf (particularly w/ test-time scaling).

Concretely, what metrics of the pre-trained model can we link to downstream success?

October 25, 2025 at 4:20 PM

Dylan Foster 🐢

@djfoster.bsky.social

(1/12) Many have observed starting that from a better next‑token predictor (low cross-entropy) does not always give better downstream post-training/test-time perf. In some cases, CE can even be anti‑correlated with BoN success (see fig or arxiv.org/abs/2502.07154). Why is this?

October 25, 2025 at 4:20 PM

Dylan Foster 🐢

@djfoster.bsky.social

Led by my amazing intern Fan Chen, with awesome team Audrey Huang (@ahahaudrey.bsky.social), Noah Golowich, Sadhika Malladi (@sadhika.bsky.social), Adam Block, Jordan Ash, and Akshay Krishnamurthy.

Paper: www.arxiv.org/abs/2510.15020

Thread below.

October 25, 2025 at 4:20 PM

Dylan Foster 🐢

@djfoster.bsky.social

(10/12) Perhaps the coolest part: the coverage perspective seems to have non-trivial algo design consequences:

1. Vanilla SGD can have poor coverage; gradient normalization (a-la Adam) fixes this

2. Test-time training (SGD on tokens during generation) provably improves coverage

October 25, 2025 at 4:17 PM

Dylan Foster 🐢

@djfoster.bsky.social

Sorry about that! There was an issue where the application briefly closed early. Today is the deadline and it should be open again now.

October 22, 2025 at 7:39 PM

Dylan Foster 🐢

@djfoster.bsky.social

Lots of interesting directions here! We think there is a lot more to do building on the connection to the discrete sampling/TCS literature and algos from this space, as well as moving beyond autoregressive generation.

October 12, 2025 at 4:26 PM

Dylan Foster 🐢

@djfoster.bsky.social

Empirically, we have only tried this w/ small-ish scale so far, but find consistently that VGB outperforms textbook algos on either (1) accuracy; or (2) diversity when compute-normalized. Ex: for Dyck language, VGB escapes the accuracy-diversity frontier for baselines algos.

October 12, 2025 at 4:26 PM

Dylan Foster 🐢

@djfoster.bsky.social

Main guarantee:
- As long as you have exact/verifiable outcome rewards, always converges to optimal distribution.
- Runtime depends on process verifier quality, gracefully degrading as quality gets worse.

October 12, 2025 at 4:26 PM

Dylan Foster 🐢

@djfoster.bsky.social

VGB generalizes the Sinclair–Jerrum '89 random walk (people.eecs.berkeley.edu/~sinclair/ap...)
from TCS (used to prove equivalence of apx. counting & sampling for self-reducible problems), linking test-time RL/alignment with discrete sampling theory. We are super excited about this connection.

October 12, 2025 at 4:26 PM

Dylan Foster 🐢

@djfoster.bsky.social

We give a new algo, Value-Guided Backtracking (VGB), where the idea is to view autoregressive generation as a random walk on the tree of partial outputs, and add a *stochastic backtracking* step—occasionally erasing tokens in a principled way—to counter error amplification.

October 12, 2025 at 4:26 PM

Dylan Foster 🐢

@djfoster.bsky.social

Test-time guidance with learned process verifiers has untapped potential to enhance language model reasoning, but one of the issues with getting this to actually work is that small verifier mistakes are amplified by textbook algos (e.g., block-wise BoN), w/ errors compounding as length increases.

October 12, 2025 at 4:26 PM

Dylan Foster 🐢

@djfoster.bsky.social

With amazing team: Dhruv Rohatgi, Abhishek Shetty, Donya Saless, Yuchen Li, Ankur Moitra, and Andrej Risteski (andrejristeski.bsky.social).

Short thread below.

Andrej Risteski (@andrejristeski.bsky.social)

Machine learning researcher. Professor in ML department at CMU.

andrejristeski.bsky.social

October 12, 2025 at 4:26 PM

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news