Lightnews — Scholar-powered news

Simon Schug

@smonsays.bsky.social

📄 Paper: arxiv.org/abs/2507.07207
💻 Code: github.com/smonsays/sca...

November 4, 2025 at 2:33 PM

Simon Schug

@smonsays.bsky.social

But, not all training distributions enable compositional generalization -- even with scale.
Strategically choosing the training data matters a lot.

November 4, 2025 at 2:33 PM

Simon Schug

@smonsays.bsky.social

We prove that MLPs can implement a general class of compositional tasks ("hyperteachers") using only a linear number of neurons in the number of modules, beating the exponential!

November 4, 2025 at 2:33 PM

Simon Schug

@smonsays.bsky.social

It turns out that simply scaling multilayer perceptrons / transformers can lead to compositional generalization.

November 4, 2025 at 2:33 PM

Simon Schug

@smonsays.bsky.social

Most natural data has compositional structure. This leads to a combinatorial explosion that is impossible to fully cover in the training data.

It might be tempting to think that we need to equip neural network architectures with stronger symbolic priors to capture this compositionality, but do we?

November 4, 2025 at 2:33 PM

Simon Schug

@smonsays.bsky.social

Does scaling lead to compositional generaliztation?

Our #NeurIPS2025 Spotlight paper suggests that it can -- with the right training distribution.

🧵 A short thread:

Plots showing how scaling model size and data size leads to compositional generalization

A generated image composition of a clock inside a treasure chest inside a transparent cube.

November 4, 2025 at 2:33 PM

Simon Schug

@smonsays.bsky.social

Are transformers smarter than you? Hypernetworks might explain why.

Come checkout our Oral at #ICLR tomorrow (Apr 26th, poster at 10:00, Oral session 6C in the afternoon).

openreview.net/forum?id=V4K...

April 25, 2025 at 4:50 AM

Simon Schug

@smonsays.bsky.social

Indeed in line with the hypothesis that the hypernetwork mechanism supports compositionality, this modification (hyla) improves performance on unseen tasks.

October 28, 2024 at 3:26 PM

Simon Schug

@smonsays.bsky.social

So what happens if we strengthen the hypernetwork mechanism?
Could we maybe further improve compositionality?

We can for instance make the value network nonlinear - without introducing additional parameters.

October 28, 2024 at 3:26 PM

Simon Schug

@smonsays.bsky.social

Training a simple decoder on the latent codes of training tasks allows us to predict the operations performed by the network on unseen tasks - especially for later layers.

October 28, 2024 at 3:25 PM

Simon Schug

@smonsays.bsky.social

To test this hypothesis we train small transformer models to solve abstract reasoning tasks. When we look at the latent code of tasks they have never seen before, we find a highly structured space.

October 28, 2024 at 3:25 PM

Simon Schug

@smonsays.bsky.social

From the hypernetwork perspective, a compact latent code specifies key-query specific operations.
Importantly, these operations are reusable: the same hypernetwork is used across all key-query pairs.

Could their reuse allow transformers to compositionally generalize?

October 28, 2024 at 3:24 PM

Simon Schug

@smonsays.bsky.social

For a given query, multi-head attention can be rewritten as a sum over the outputs of key-query specific value networks configured by a hypernetwork.

These hypernetworks are comparably simple: Both the hypernetwork and its value network are linear.

So why could this matter?

October 28, 2024 at 3:23 PM

Simon Schug

@smonsays.bsky.social

We know that hypernetworks - neural networks that generate the weights of another neural network - can compositionally generalize. So, should we build more hypernetworks into our transformers?

It turns out that attention with multiple heads already has them built-in!

October 28, 2024 at 3:22 PM

Simon Schug

@smonsays.bsky.social

Neural networks used to struggle with compositionality but transformers got really good at it. How come?

And why does attention work so much better with multiple heads?

There might be a common answer to both of these questions.

October 28, 2024 at 3:22 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news