Lightnews — Scholar-powered news

Core Francisco Parkg

@corefpark.bsky.social

⚠️⚠️ But here comes drama!!!

What if the news appears in the context upstream of the *same* FT data?

🚨 Contextual Shadowing happens!

Prefixing the news during FT *catastrophically* reduces learning!

10/n

May 21, 2025 at 12:07 AM

Core Francisco Parkg

@corefpark.bsky.social

Next, we analyzed Sys2-FT from a scaling law perspective. We found an empirical scaling law of Sys2-FT where the knowledge integration is a function of the compute spent.

Larger models are thus more data efficient learners!

Note that this scaling isn’t evident in loss.

9/n

May 21, 2025 at 12:07 AM

Core Francisco Parkg

@corefpark.bsky.social

Interestingly, Sys2-FT shines most in domains where System-2 inference has seen the most success: Math and Coding.

8/n

May 21, 2025 at 12:07 AM

Core Francisco Parkg

@corefpark.bsky.social

Among these protocols, Self-QA especially stood out, largely mitigating the FT-ICL gap and integrating the given knowledge into the model!

Training on synthetic Q/A pairs really boost knowledge integration!

7/n

May 21, 2025 at 12:07 AM

Core Francisco Parkg

@corefpark.bsky.social

Inspired by cognitive science on memory consolidation, we introduce System-2 Fine-Tuning (Sys2-FT). Models actively rehearse, paraphrase, and self-play about new facts to create fine-tuning data. We explore three protocols: Paraphrase, Implication, and Self-QA.

6/n

May 21, 2025 at 12:07 AM

Core Francisco Parkg

@corefpark.bsky.social

As expected, naïve fine-tuning on the raw facts isn’t enough to integrate knowledge across domains or model sizes up to 32B.

We call this the FT-ICL gap.

5/n

May 21, 2025 at 12:07 AM

Core Francisco Parkg

@corefpark.bsky.social

But how do we update the model’s weights to bake in this new rule?

To explore this, we built “New News”: 75 new hypothetical (but non-counterfactual) facts across diverse domains, paired with 375 downstream questions.

4/n

May 21, 2025 at 12:07 AM

Core Francisco Parkg

@corefpark.bsky.social

🚨 New Paper!

A lot happens in the world every day—how can we update LLMs with belief-changing news?

We introduce a new dataset "New News" and systematically study knowledge integration via System-2 Fine-Tuning (Sys2-FT).

1/n

May 21, 2025 at 12:07 AM

Core Francisco Parkg

@corefpark.bsky.social

We further investigate how this critical context size for an in-context transition scales with graph size.

We find a power law relationship between the critical context size and the graph size.

13/n

January 5, 2025 at 4:02 PM

Core Francisco Parkg

@corefpark.bsky.social

We find that LLMs indeed minimize the spectral energy on the graph and the rule-following accuracy sharply rises after the energy hits a minimum!

12/n

January 5, 2025 at 4:02 PM

Core Francisco Parkg

@corefpark.bsky.social

How to explain these results? We hypothesize a model runs an implicit optimization process to adapt to context-specified tasks (akin to in-context GD by @oswaldjoh et al), prompting an analysis of Dirichlet energy between the ground-truth graph & model representation.

11/n

January 5, 2025 at 4:02 PM

Core Francisco Parkg

@corefpark.bsky.social

What happens when there is a strong semantic structure acquired during pretraining?
We set up a task where the days of the week should be navigated in an unusual way: Mon -> Thu -> Sun, etc.

Here, we find that in-context representations show up in higher PC dimensions.

10/n

January 5, 2025 at 4:02 PM

Core Francisco Parkg

@corefpark.bsky.social

We call these context dependent representations “In-Context Representations” and these appear robustly across graph structures and models.

9/n

January 5, 2025 at 4:02 PM

Core Francisco Parkg

@corefpark.bsky.social

What about a different structure?

Here, we used a ring graph and sampled random neighbors on the graph.
Again, we find that internal representations re-organizes to match the task structure.

8/n

January 5, 2025 at 4:02 PM

Core Francisco Parkg

@corefpark.bsky.social

Interestingly, a similar phenomenon was observed in humans! One can reconstruct the graph underlying a sequence of random images from fMRI scans of the brain during the task.

elifesciences.org/articles/17086

7/n

January 5, 2025 at 4:02 PM

Core Francisco Parkg

@corefpark.bsky.social

Surprisingly, when we input this sequence to Llama-3.1-8B, the model’s internal representations show an emergent grid structure matching the task in its first principal components!

6/n

January 5, 2025 at 4:02 PM

Core Francisco Parkg

@corefpark.bsky.social

But do LLM representations also reflect the structure of a task given purely in context?

To explore this question, we set up a synthetic task where we put words on a grid and perform a random walk. The random walk outputs the words it accessed as a sequence.

5/n

January 5, 2025 at 4:02 PM

Core Francisco Parkg

@corefpark.bsky.social

We know that LLM representations reflect the structure of the real world’s data generating process. For example, @JoshAEngels showed that the days of the weeks are represented as a ring in the residual stream.

x.com/JoshAEngels/...

4/n

January 5, 2025 at 4:02 PM

Core Francisco Parkg

@corefpark.bsky.social

New paper! “In-Context Learning of Representations”

What happens to an LLM’s internal representations in the large context limit?

We find that LLMs form “in-context representations” to match the structure of the task given in context!

January 5, 2025 at 4:02 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news