Eshaan Nichani
eshaannichani.bsky.social
Eshaan Nichani
@eshaannichani.bsky.social
phd student @ princeton · deep learning theory
eshaannichani.com
Indeed, training two-layer nets in practice matches the theoretical scaling law: (8/10)
May 5, 2025 at 4:14 PM
As a corollary, when the a_p follow a power law, then the population loss exhibits a power law decay in the runtime/sample size and student width.

Matches functional form of empirical neural scaling laws (eg. Chinchilla)! (7/10)
May 5, 2025 at 4:14 PM
We train a 2-homogeneous two-layer student neural net via online SGD on the squared loss.

Main Theorem: to recover the top P ≤ P* = d^c directions, student width m = Θ(P*) and sample size poly(d, 1/a_{P*}, P) suffice.

Polynomial complexity with a single-stage algorithm! (6/10)
May 5, 2025 at 4:14 PM
Additive model target is thus a width P two-layer neural network.

Prior works either assume P = O(1) (multi-index model) or require complexity exponential in κ=a_1/a_P.

But to get a smooth scaling law, we need to handle many tasks (P→∞) with varying strengths (κ→∞) (5/10)
May 5, 2025 at 4:14 PM
One explanation is the additive model hypothesis:
- The cumulative loss can be decomposed into many distinct skills, each of which individually exhibits emergence.
- The juxtaposition of many learning curves at varying timescales leads to a smooth power law in the loss. (3/10)
May 5, 2025 at 4:14 PM
LLMs demonstrate “emergent capabilities”: acquisition of a single task/skill exhibits sharp transition as compute increases.

Yet “neural scaling laws” posit that increasing compute leads to predictable power law decay in the loss.

How do we reconcile these two phenomena? (2/10)
May 5, 2025 at 4:14 PM
Excited to announce a new paper with Yunwei Ren, Denny Wu,
@jasondeanlee.bsky.social!

We prove a neural scaling law in the SGD learning of extensive width two-layer neural networks.

arxiv.org/abs/2504.19983

🧵below (1/10)
May 5, 2025 at 4:14 PM