Lightnews — Scholar-powered news

Eshaan Nichani

@eshaannichani.bsky.social

25 followers 75 following 10 posts

phd student @ princeton · deep learning theory
eshaannichani.com

Posts Replies Media Videos

Eshaan Nichani

@eshaannichani.bsky.social

Altogether, our work provides theoretical justification for the additive model hypothesis in gradient-based feature learning of shallow neural networks.

Check out our paper to learn more! (10/10)

May 5, 2025 at 4:14 PM

Eshaan Nichani

@eshaannichani.bsky.social

Compared to prior scaling laws theory, we study the high-dim feature learning regime, and don't assume the learning of different tasks can be decoupled a priori.

Instead, decoupling of different tasks (and thus emergence) arises from a "deflation" mechanism induced by SGD (9/10)

May 5, 2025 at 4:14 PM

Eshaan Nichani

@eshaannichani.bsky.social

Indeed, training two-layer nets in practice matches the theoretical scaling law: (8/10)

May 5, 2025 at 4:14 PM

Eshaan Nichani

@eshaannichani.bsky.social

As a corollary, when the a_p follow a power law, then the population loss exhibits a power law decay in the runtime/sample size and student width.

Matches functional form of empirical neural scaling laws (eg. Chinchilla)! (7/10)

May 5, 2025 at 4:14 PM

Eshaan Nichani

@eshaannichani.bsky.social

We train a 2-homogeneous two-layer student neural net via online SGD on the squared loss.

Main Theorem: to recover the top P ≤ P* = d^c directions, student width m = Θ(P*) and sample size poly(d, 1/a_{P*}, P) suffice.

Polynomial complexity with a single-stage algorithm! (6/10)

May 5, 2025 at 4:14 PM

Eshaan Nichani

@eshaannichani.bsky.social

Additive model target is thus a width P two-layer neural network.

Prior works either assume P = O(1) (multi-index model) or require complexity exponential in κ=a_1/a_P.

But to get a smooth scaling law, we need to handle many tasks (P→∞) with varying strengths (κ→∞) (5/10)

May 5, 2025 at 4:14 PM

Eshaan Nichani

@eshaannichani.bsky.social

We study an idealized setting where each “skill” is a Gaussian single-index model f*(x) = aσ(w•x).

Prior work (Ben Arous et al ’21) shows that SGD exhibits emergence: a long “search phase” with a loss plateau is followed by a rapid “descent phase” where loss converges. (4/10)

May 5, 2025 at 4:14 PM

Eshaan Nichani

@eshaannichani.bsky.social

One explanation is the additive model hypothesis:
- The cumulative loss can be decomposed into many distinct skills, each of which individually exhibits emergence.
- The juxtaposition of many learning curves at varying timescales leads to a smooth power law in the loss. (3/10)

May 5, 2025 at 4:14 PM

Eshaan Nichani

@eshaannichani.bsky.social

LLMs demonstrate “emergent capabilities”: acquisition of a single task/skill exhibits sharp transition as compute increases.

Yet “neural scaling laws” posit that increasing compute leads to predictable power law decay in the loss.

How do we reconcile these two phenomena? (2/10)

May 5, 2025 at 4:14 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news