Lightnews — Scholar-powered news

Nora Belrose

@norabelrose.bsky.social

AI, philosophy, spirituality

Head of interpretability research at EleutherAI, but posts are my own views, not Eleuther’s.

Posts Replies Media Videos

Nora Belrose

@norabelrose.bsky.social

I love this meme

February 22, 2025 at 5:33 AM

Nora Belrose

@norabelrose.bsky.social

We find that the probability of sampling a network at random— or local volume for short— decreases exponentially as the network is trained.

And networks which memorize their training data without generalizing have lower local volume— higher complexity— than generalizing ones.

February 3, 2025 at 10:01 PM

Nora Belrose

@norabelrose.bsky.social

But the total volume can be strongly influenced by a small number of outlier directions, which are hard to sample in high dimension— think of a big, flat pancake.

Importance sampling using gradient info helps address this issue by making us more likely to sample outliers.

February 3, 2025 at 10:01 PM

Nora Belrose

@norabelrose.bsky.social

It works by exploring random directions in weight space, starting from an "anchor" network.

The distance from the anchor to the edge of the region, along the random direction, gives us an estimate of how big (or how probable) the region is as a whole.

February 3, 2025 at 10:01 PM

Nora Belrose

@norabelrose.bsky.social

What are the chances you'd get a fully functional language model by randomly guessing the weights?

We crunched the numbers and here's the answer:

February 3, 2025 at 10:01 PM

Nora Belrose

@norabelrose.bsky.social

we have seven (!) papers lined up for release next week

you know you're on a roll when arxiv throttles you

February 2, 2025 at 3:35 AM

Nora Belrose

@norabelrose.bsky.social

Unfortunately, computing the entire training Jacobian and performing SVD on it is computationally intractable for all but the smallest networks.

We focused on a tiny 5K parameter MLP for most experiments, but we did find a similar SV spectrum in a 62K param image classifier.

December 11, 2024 at 8:32 PM

Nora Belrose

@norabelrose.bsky.social

The dimensionality of the bulk is much smaller when training a network on white noise, than when training on real data.

This makes sense: training has to more aggressively compress parameter space in order to memorize unstructured data.

December 11, 2024 at 8:32 PM

Nora Belrose

@norabelrose.bsky.social

The Jacobian is only valid locally around a single init, so the bulks for different random seeds will be different.

Empirically though, we find a lot of overlap between bulks generated by different random inits. It appears to depend on the data, but only weakly on the labels.

December 11, 2024 at 8:32 PM

Nora Belrose

@norabelrose.bsky.social

The SVs close to one correspond to a subspace of parameter space, the bulk, which training doesn't touch at all.

There is also a chaotic subspace, with SVs greater than one, along which perturbations are magnified.

Training tends not to change parameters much along the bulk.

December 11, 2024 at 8:32 PM

Nora Belrose

@norabelrose.bsky.social

Imagine a small sphere in parameter space centered at the init.

Training will stretch, shrink, and rotate the sphere into an ellipsoid.

Jacobian singular values tell us the radii of the principal axes of that ellipse. Here's what they look like. A lot of radii are close to one!

December 11, 2024 at 8:32 PM

Nora Belrose

@norabelrose.bsky.social

How do a neural network's final parameters depend on its initial ones?

In this new paper, we answer this question by analyzing the training Jacobian, the matrix of derivatives of the final parameters with respect to the initial parameters.
https://arxiv.org/abs/2412.07003

December 11, 2024 at 8:30 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news