Lightnews — Scholar-powered news

Andrew Lampinen

@lampinen.bsky.social

I'm not sure I fully understand this point; part of our argument here (as well as in some of our past work: arxiv.org/abs/2505.00661) is that models *can* readily produce the reversals when the information is in context; they just *don't* unless there is some problem to solve or other cue to do so.

On the generalization of language models from in-context learning and finetuning: a controlled study

Large language models exhibit exciting capabilities, yet can show surprisingly narrow generalization from finetuning. E.g. they can fail to generalize to simple reversals of relations they are trained...

arxiv.org

September 23, 2025 at 11:10 PM

Andrew Lampinen

@lampinen.bsky.social

Hahaha much appreciated

September 22, 2025 at 9:47 PM

Andrew Lampinen

@lampinen.bsky.social

Even comparing my own work in different areas; it's harder to be both timely and as through with LM works, especially with the scale of experiments

September 22, 2025 at 7:46 PM

Andrew Lampinen

@lampinen.bsky.social

I was gonna say, I feel attacked by this tweet 😅

September 22, 2025 at 7:44 PM

Andrew Lampinen

@lampinen.bsky.social

Check out the paper if you’re interested! arxiv.org/abs/2509.16189
And thanks to my awesome collaborators: @martinengelcke.bsky.social, Effie Li, @arslanchaudhry.bsky.social and James McClelland. 9/9

Latent learning: episodic memory complements parametric learning by enabling flexible reuse of experiences

When do machine learning systems fail to generalize, and what mechanisms could improve their generalization? Here, we draw inspiration from cognitive science to argue that one weakness of machine lear...

arxiv.org

September 22, 2025 at 4:21 AM

Andrew Lampinen

@lampinen.bsky.social

We think this work sheds light on why retrieval offers distinct benefits beyond just training models more, and provides a different perspective on why episodic memory and parametric learning are complementary, which we hope will be of interest for both AI and cognitive science 8/

September 22, 2025 at 4:21 AM

Andrew Lampinen

@lampinen.bsky.social

In the paper, we explore many more settings & nuances — including RL and BC versions of maze navigation experiments based on the original experiments on latent learning in rats, the effects of associative cues, the importance of within-episode ICL, and ablations. 7/

September 22, 2025 at 4:21 AM

Andrew Lampinen

@lampinen.bsky.social

We show that even when models generalize well from parametric learning in standard (nontrivial) evaluations, there are selective, consistent failures of latent learning. Only models with retrieval generalize well on the key tests of latent learning. 6/

The benefits of oracle retrieval on the (a) Codebooks and (b) simple reversals benchmarks. Both baseline and retrieval models perform well on component tasks like recalling definitions, or encoding new sequences involving indices used in encoding during training (a, center). However, performance differs dramatically on the latent encoding test (right bars on both plots), where only the model with retrieval achieves above-chance performance.

September 22, 2025 at 4:21 AM

Andrew Lampinen

@lampinen.bsky.social

To illustrate this point, we explore latent learning across a wide range of benchmarks (from codebook translation to BC and RL navigation) — and compare baseline language models or agents to those equipped with oracle retrieval. 5/

The benchmarks we use and the key types of latent generalization that they test. (a) The codebooks benchmark tests the ability to use latent indices (highlighted in red) for which only the definitions have been seen in training to complete test encoding sequences. (b) The simple reversals benchmark tests the ability of models to reverse relations seen in training, and which models have learned to reverse in-context. (c) The semantic structure benchmark uses training embedded in more naturalistic text to test latent generalization types ranging from reversals to syllogisms, or more challenging category-inclusion-only holdouts. (d) The latent gridworld—with both its pixel-based RL and ASCII-based BC instantiations—tests the ability to navigate to objects that have never been a navigation goal in training for a particular maze, but have been frequently seen.

September 22, 2025 at 4:21 AM

Andrew Lampinen

@lampinen.bsky.social

But models can readily use latent information in their context. We therefore suggest that natural intelligence solves the latent learning problem via the complementary strengths of episodic memory: reinstating experiences into context makes latent information accessible. 4/

Explicit retrieval of learning experiences from nonparametric learning systems complements the broader knowledge of parametric learning—by making select, relevant experiences available in context where they can be more flexibly used in ways different from the original task setting in which they were encountered.

September 22, 2025 at 4:21 AM

Andrew Lampinen

@lampinen.bsky.social

we argue that parametric learning methods are too tied to the explicit training task, and fail to effectively encode latent information relevant to possible future tasks, and we suggest that this explains a wide range of findings, from navigation to the reversal curse. 3/

While a model may be trained on some explicit information (e.g. X is Y's teacher" or goals (e.g. navigate to Z), there may be other information latent in it (such as a reversal "Y is X's teacher).
Challenges of reversal are one instance of the much broader phenomenon that what is explicitly learned may also latently convey information relevant to other tasks—e.g., multi-hop reasoning,
alternative goals, or answering questions in other languages. Like the reversal curse, learning on such sequences may primarily improve performance on the explicit information or goals; however, if the sequence were in context, models would readily be able to make inferences about the latent information.

September 22, 2025 at 4:21 AM

Andrew Lampinen

@lampinen.bsky.social

We take inspiration from classic experiments on latent learning in animals, where the animals learn about information that is not useful at present, but that might be useful later — for example, learning the location of useful resources in passing. By contrast, 2/

September 22, 2025 at 4:21 AM

Andrew Lampinen

@lampinen.bsky.social

Thanks! Yes, I'm interested in which constraints most strongly push against this: 1) efficiency of acting (current FHE is slow), 2) efficiency of learning (simplicity bias), 3) maybe relatedly probability of learning a la arxiv.org/abs/1805.08522 or 4) some combination thereof

Deep learning generalizes because the parameter-function map is biased towards simple functions

Deep neural networks (DNNs) generalize remarkably well without explicit regularization even in the strongly over-parametrized regime where classical learning theory would instead predict that they wou...

arxiv.org

August 6, 2025 at 2:59 AM

Andrew Lampinen

@lampinen.bsky.social

they're mostly equivalent after mean-centering: www.biorxiv.org/content/10.1... fwiw

Equivalence between representational similarity analysis, centered kernel alignment, and canonical correlations analysis

Centered kernel alignment (CKA) and representational similarity analysis (RSA) of dissimilarity matrices are two popular methods for comparing neural systems in terms of representational geometry. Alt...

www.biorxiv.org

August 5, 2025 at 8:18 PM

Andrew Lampinen

@lampinen.bsky.social

When we've compared these in past work e.g. Supplement fig. A.13 here proceedings.neurips.cc/paper/2020/h... we've seen pretty similar results between the two. Haven't run it in exactly this setting though. There are also some arguments that 1/2

August 5, 2025 at 8:18 PM

Andrew Lampinen

@lampinen.bsky.social

even though both are linearly decodable and equally predictive. Katherine's paper studies some instances more thoroughly in simple settings. My sense though is that the magnitude of these effects are quite a bit smaller than the base bias, so probably not a huge issue if datasets aren't tiny. 2/2

August 5, 2025 at 6:28 PM

Andrew Lampinen

@lampinen.bsky.social

I don't know of any reviews unfortunately! Fig. 16 in our TMLR paper (openreview.net/forum?id=aY2...) shows an instance — training classifiers on the penultimate reps to decode a label predicted by both easy and hard features; at high predictivity the classifier prefers the easy feature, even 1/2

August 5, 2025 at 6:28 PM

Andrew Lampinen

@lampinen.bsky.social

Thanks, glad you like it!

August 5, 2025 at 5:49 PM

Andrew Lampinen

@lampinen.bsky.social

just by dimensionality arguments (input dim 64 << first rep 256) even before training *any* function of the inputs will likely be computable from that rep with a sufficiently complex nonlinear decoder — even features like XOR that the model is *incapable* of computing at the first layer. 2/2

August 5, 2025 at 4:30 PM

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news