Lightnews — Scholar-powered news

Eshaan Nichani

@eshaannichani.bsky.social

25 followers 75 following 10 posts

phd student @ princeton · deep learning theory
eshaannichani.com

Posts Replies Media Videos

Eshaan Nichani

@eshaannichani.bsky.social

Indeed, training two-layer nets in practice matches the theoretical scaling law: (8/10)

May 5, 2025 at 4:14 PM

Eshaan Nichani

@eshaannichani.bsky.social

As a corollary, when the a_p follow a power law, then the population loss exhibits a power law decay in the runtime/sample size and student width.

Matches functional form of empirical neural scaling laws (eg. Chinchilla)! (7/10)

May 5, 2025 at 4:14 PM

Eshaan Nichani

@eshaannichani.bsky.social

We train a 2-homogeneous two-layer student neural net via online SGD on the squared loss.

Main Theorem: to recover the top P ≤ P* = d^c directions, student width m = Θ(P*) and sample size poly(d, 1/a_{P*}, P) suffice.

Polynomial complexity with a single-stage algorithm! (6/10)

May 5, 2025 at 4:14 PM

Eshaan Nichani

@eshaannichani.bsky.social

Additive model target is thus a width P two-layer neural network.

Prior works either assume P = O(1) (multi-index model) or require complexity exponential in κ=a_1/a_P.

But to get a smooth scaling law, we need to handle many tasks (P→∞) with varying strengths (κ→∞) (5/10)

May 5, 2025 at 4:14 PM

Eshaan Nichani

@eshaannichani.bsky.social

One explanation is the additive model hypothesis:
- The cumulative loss can be decomposed into many distinct skills, each of which individually exhibits emergence.
- The juxtaposition of many learning curves at varying timescales leads to a smooth power law in the loss. (3/10)

May 5, 2025 at 4:14 PM

Eshaan Nichani

@eshaannichani.bsky.social

LLMs demonstrate “emergent capabilities”: acquisition of a single task/skill exhibits sharp transition as compute increases.

Yet “neural scaling laws” posit that increasing compute leads to predictable power law decay in the loss.

How do we reconcile these two phenomena? (2/10)

May 5, 2025 at 4:14 PM

Eshaan Nichani

@eshaannichani.bsky.social

Excited to announce a new paper with Yunwei Ren, Denny Wu,
@jasondeanlee.bsky.social!

We prove a neural scaling law in the SGD learning of extensive width two-layer neural networks.

arxiv.org/abs/2504.19983

🧵below (1/10)

May 5, 2025 at 4:14 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news