Charlie Snell
banner
seasnell.bsky.social
Charlie Snell
@seasnell.bsky.social
PhD @berkeley_ai; prev SR @GoogleDeepMind. I stare at my computer a lot and make things
All model checkpoints we used for this research are also available here: t.co/IlSmJ8Na1i
https://huggingface.co/openlm-research
t.co
November 26, 2024 at 10:37 PM
This was a fun project with Eric Wallace, Dan Klein, and Sergey Levine
.
An early version of this work also appeared in COLM 2024.

Paper link: arxiv.org/abs/2411.16035
Predicting Emergent Capabilities by Finetuning
A fundamental open challenge in modern LLM scaling is the lack of understanding around emergent capabilities. In particular, language model pretraining loss is known to be highly predictable as a func...
arxiv.org
November 26, 2024 at 10:37 PM
Finally, we present a case study of two real world uses for emergence prediction:

1) cheaply assessing pretraining data quality (left).

2) predicting more complex capabilities, closer to those of future frontier models, using the difficult APPS coding benchmark (right).
November 26, 2024 at 10:37 PM
We validate our emergence law using four standard NLP benchmarks where large-scale open-source LLMs already demonstrate emergence, so we can easily check our predictions.

We find that our emergence law can accurately predict the point of emergence up to 4x the FLOPs in advance.
November 26, 2024 at 10:37 PM
To operationalize this insight, we finetune LLMs on varying amounts of data and fit a parametric function (i.e., “emergence law”) which models how the point of emergence shifts with the amount of data. We can then extrapolate a prediction for emergence in the few-shot setting.
November 26, 2024 at 10:37 PM
We then discover a simple insight for this problem:

finetuning LLMs on a given task can shift the point in scaling at which emergence occurs towards less capable LLMs, and the magnitude of this shift is modulated by the amount of finetuning data.
November 26, 2024 at 10:37 PM
We first pose the task of emergence prediction:

given access to LLMs that have random few-shot accuracy on a task, can we predict the point in scaling (e.g., pretraining loss) at which performance will jump up beyond random-chance?
November 26, 2024 at 10:37 PM