Head of interpretability research at EleutherAI, but posts are my own views, not Eleuther’s.
First, it sheds light on how deep learning works. The "volume hypothesis" says DL is similar to randomly sampling a network from weight space that gets low training loss. But this can't be tested if we can't measure volume.
First, it sheds light on how deep learning works. The "volume hypothesis" says DL is similar to randomly sampling a network from weight space that gets low training loss. But this can't be tested if we can't measure volume.
And networks which memorize their training data without generalizing have lower local volume— higher complexity— than generalizing ones.
And networks which memorize their training data without generalizing have lower local volume— higher complexity— than generalizing ones.
Importance sampling using gradient info helps address this issue by making us more likely to sample outliers.
Importance sampling using gradient info helps address this issue by making us more likely to sample outliers.
The distance from the anchor to the edge of the region, along the random direction, gives us an estimate of how big (or how probable) the region is as a whole.
The distance from the anchor to the edge of the region, along the random direction, gives us an estimate of how big (or how probable) the region is as a whole.
You can think of this as a measure of complexity: less probable, means more complex.
You can think of this as a measure of complexity: less probable, means more complex.