Eshaan Nichani
eshaannichani.bsky.social
Eshaan Nichani
@eshaannichani.bsky.social
phd student @ princeton · deep learning theory
eshaannichani.com
Altogether, our work provides theoretical justification for the additive model hypothesis in gradient-based feature learning of shallow neural networks.

Check out our paper to learn more! (10/10)
May 5, 2025 at 4:14 PM
Compared to prior scaling laws theory, we study the high-dim feature learning regime, and don't assume the learning of different tasks can be decoupled a priori.

Instead, decoupling of different tasks (and thus emergence) arises from a "deflation" mechanism induced by SGD (9/10)
May 5, 2025 at 4:14 PM
Indeed, training two-layer nets in practice matches the theoretical scaling law: (8/10)
May 5, 2025 at 4:14 PM
As a corollary, when the a_p follow a power law, then the population loss exhibits a power law decay in the runtime/sample size and student width.

Matches functional form of empirical neural scaling laws (eg. Chinchilla)! (7/10)
May 5, 2025 at 4:14 PM
We train a 2-homogeneous two-layer student neural net via online SGD on the squared loss.

Main Theorem: to recover the top P ≤ P* = d^c directions, student width m = Θ(P*) and sample size poly(d, 1/a_{P*}, P) suffice.

Polynomial complexity with a single-stage algorithm! (6/10)
May 5, 2025 at 4:14 PM
Additive model target is thus a width P two-layer neural network.

Prior works either assume P = O(1) (multi-index model) or require complexity exponential in κ=a_1/a_P.

But to get a smooth scaling law, we need to handle many tasks (P→∞) with varying strengths (κ→∞) (5/10)
May 5, 2025 at 4:14 PM
We study an idealized setting where each “skill” is a Gaussian single-index model f*(x) = aσ(w•x).

Prior work (Ben Arous et al ’21) shows that SGD exhibits emergence: a long “search phase” with a loss plateau is followed by a rapid “descent phase” where loss converges. (4/10)
May 5, 2025 at 4:14 PM
One explanation is the additive model hypothesis:
- The cumulative loss can be decomposed into many distinct skills, each of which individually exhibits emergence.
- The juxtaposition of many learning curves at varying timescales leads to a smooth power law in the loss. (3/10)
May 5, 2025 at 4:14 PM
LLMs demonstrate “emergent capabilities”: acquisition of a single task/skill exhibits sharp transition as compute increases.

Yet “neural scaling laws” posit that increasing compute leads to predictable power law decay in the loss.

How do we reconcile these two phenomena? (2/10)
May 5, 2025 at 4:14 PM