Lightnews — Scholar-powered news

Reposted by Daniel Wurgaft

Sam Gershman

@gershbrain.bsky.social

A bias for simplicity by itself does not guarantee good generalization (see the No Free Lunch Theorems). So an inductive bias is only good to the extent that it reflects structure in the data. Is the world simple? The success of deep nets (with their intrinsic Occam's razor) would suggest yes(?)

July 8, 2025 at 1:57 PM

Daniel Wurgaft

@danielwurgaft.bsky.social

Hi thanks for the comment! I'm not too familiar with the robot-learning literature but would love to learn more about it!

July 1, 2025 at 7:59 PM

Daniel Wurgaft

@danielwurgaft.bsky.social

Thank you Andrew!! :)

June 28, 2025 at 11:54 AM

Daniel Wurgaft

@danielwurgaft.bsky.social

On a personal note, this is my first full-length first-author paper! @ekdeepl.bsky.social and I both worked so hard on this, and I am so excited about our results and the perspective we bring! Follow for more science of deep learning and human learning!

16/16

June 28, 2025 at 2:35 AM

Daniel Wurgaft

@danielwurgaft.bsky.social

Thank you to amazing collaborators!
@ekdeepl.bsky.social @corefpark.bsky.social @gautamreddy.bsky.social @hidenori8tanaka.bsky.social @noahdgoodman.bsky.social
See the paper for full results and discussion! And watch for updates! We are working on explaining and unifying more ICL phenomena! 15/

In-Context Learning Strategies Emerge Rationally

Recent work analyzing in-context learning (ICL) has identified a broad set of strategies that describe model behavior in different experimental conditions. We aim to unify these findings by asking why...

arxiv.org

June 28, 2025 at 2:35 AM

Daniel Wurgaft

@danielwurgaft.bsky.social

💡Key takeaways:
3) A top-down, normative perspective offers a powerful, predictive approach for understanding neural networks, complementing bottom-up mechanistic work.

14/

June 28, 2025 at 2:35 AM

Daniel Wurgaft

@danielwurgaft.bsky.social

💡Key takeaways:
2) A tradeoff between *loss and complexity* is fundamental to understanding model training dynamics, and gives a unifying explanation for ICL phenomena of transient generalization and task-diversity effects!

13/

June 28, 2025 at 2:35 AM

Daniel Wurgaft

@danielwurgaft.bsky.social

💡Key takeaways:
1) Is ICL Bayes-optimal? We argue the better question is *under what assumptions*. Cautiously, we conclude that ICL can be seen as approx. Bayesian under a simplicity bias and sublinear sample efficiency (though see our appendix for an interesting deviation!)

12/

June 28, 2025 at 2:35 AM

Daniel Wurgaft

@danielwurgaft.bsky.social

Ablations of our analytical expression show the modeled computational constraints, in their assumed functional forms, are crucial!

11/

June 28, 2025 at 2:35 AM

Daniel Wurgaft

@danielwurgaft.bsky.social

And reveals some interesting findings: MLP width increases memorization, which is captured by our model as a reduced simplicity bias!

10/

June 28, 2025 at 2:35 AM

Daniel Wurgaft

@danielwurgaft.bsky.social

Our framework also makes novel Predictions:
🔹**Sub-linear** sample efficiency → sigmoidal transition from generalization to memorization
🔹**Rapid** behavior change near the M–G crossover boundary
🔹**Superlinear** scaling of time to transience as data diversity increases

9/

June 28, 2025 at 2:35 AM

Daniel Wurgaft

@danielwurgaft.bsky.social

Intuitively, what does this predictive account imply? A rational tradeoff between a strategy's loss and complexity!

🔵Early: A simplicity bias (prior) favors a less complex strategy (G)
🔴Late: reducing loss (likelihood) favors a better-fitting, but more complex strategy (M)

8/

June 28, 2025 at 2:35 AM

Daniel Wurgaft

@danielwurgaft.bsky.social

Fitting the three free parameters of our expression, we see that across checkpoints from 11 different runs, we almost perfectly predict *next-token predictions* and the relative distance maps!

We now have a predictive model of task diversity effects and transience!

7/

June 28, 2025 at 2:35 AM

Daniel Wurgaft

@danielwurgaft.bsky.social

We assume two well-known facts about neural nets as computational constraints (scaling laws and simplicity bias). This allows writing a closed-form expression for the posterior odds!

6/

June 28, 2025 at 2:35 AM

Daniel Wurgaft

@danielwurgaft.bsky.social

We model our learner as behaving optimally in a hypothesis space defined by the M / G predictors—this yields a *hierarchical Bayesian* view:

🔹Pretraining = updating posterior probability (preference) for strategies
🔹Inference = posterior-weighted average of strategies

5/

June 28, 2025 at 2:35 AM

Daniel Wurgaft

@danielwurgaft.bsky.social

We now have a unifying language to describe what strategies a model transitions between.

Back to our question:*Why* do models switch ICL strategies?! Given M / G are *Bayes-optimal* for train / true distributions, we invoke the approach of rational analysis to answer this!

4/

June 28, 2025 at 2:35 AM

Daniel Wurgaft

@danielwurgaft.bsky.social

By computing the distance between a model’s outputs and these predictors, we show models transition between memorizing and generalizing predictors as experimental settings are varied! This yields a unifying view on known ICL phenomena of task diversity effects and transience!

3/

June 28, 2025 at 2:35 AM

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news