What if the news appears in the context upstream of the *same* FT data?
🚨 Contextual Shadowing happens!
Prefixing the news during FT *catastrophically* reduces learning!
10/n
What if the news appears in the context upstream of the *same* FT data?
🚨 Contextual Shadowing happens!
Prefixing the news during FT *catastrophically* reduces learning!
10/n
Larger models are thus more data efficient learners!
Note that this scaling isn’t evident in loss.
9/n
Larger models are thus more data efficient learners!
Note that this scaling isn’t evident in loss.
9/n
8/n
8/n
Training on synthetic Q/A pairs really boost knowledge integration!
7/n
Training on synthetic Q/A pairs really boost knowledge integration!
7/n
6/n
6/n
We call this the FT-ICL gap.
5/n
We call this the FT-ICL gap.
5/n
To explore this, we built “New News”: 75 new hypothetical (but non-counterfactual) facts across diverse domains, paired with 375 downstream questions.
4/n
To explore this, we built “New News”: 75 new hypothetical (but non-counterfactual) facts across diverse domains, paired with 375 downstream questions.
4/n
A lot happens in the world every day—how can we update LLMs with belief-changing news?
We introduce a new dataset "New News" and systematically study knowledge integration via System-2 Fine-Tuning (Sys2-FT).
1/n
A lot happens in the world every day—how can we update LLMs with belief-changing news?
We introduce a new dataset "New News" and systematically study knowledge integration via System-2 Fine-Tuning (Sys2-FT).
1/n
We find a power law relationship between the critical context size and the graph size.
13/n
We find a power law relationship between the critical context size and the graph size.
13/n
12/n
12/n
11/n
11/n
We set up a task where the days of the week should be navigated in an unusual way: Mon -> Thu -> Sun, etc.
Here, we find that in-context representations show up in higher PC dimensions.
10/n
We set up a task where the days of the week should be navigated in an unusual way: Mon -> Thu -> Sun, etc.
Here, we find that in-context representations show up in higher PC dimensions.
10/n
9/n
9/n
Here, we used a ring graph and sampled random neighbors on the graph.
Again, we find that internal representations re-organizes to match the task structure.
8/n
Here, we used a ring graph and sampled random neighbors on the graph.
Again, we find that internal representations re-organizes to match the task structure.
8/n
elifesciences.org/articles/17086
7/n
elifesciences.org/articles/17086
7/n
6/n
6/n
To explore this question, we set up a synthetic task where we put words on a grid and perform a random walk. The random walk outputs the words it accessed as a sequence.
5/n
To explore this question, we set up a synthetic task where we put words on a grid and perform a random walk. The random walk outputs the words it accessed as a sequence.
5/n
x.com/JoshAEngels/...
4/n
x.com/JoshAEngels/...
4/n
What happens to an LLM’s internal representations in the large context limit?
We find that LLMs form “in-context representations” to match the structure of the task given in context!
What happens to an LLM’s internal representations in the large context limit?
We find that LLMs form “in-context representations” to match the structure of the task given in context!