Lightnews — Scholar-powered news

Andrei Mircea

@mirandrom.bsky.social

230 followers 150 following 20 posts

PhD student at University of Montreal // Mila ··· mechanistic understanding of LLMs + Human-AI collaboration for science ··· http://mirandrom.github.io

Posts Replies Media Videos

Andrei Mircea

@mirandrom.bsky.social

we also found interesting differences in optimizers with respect to loss deceleration (www.arxiv.org/abs/2506.05447). Surprisingly, Muon had worse post-deceleration convergence. That suggests it exacerbates rather than reduces interference in language modeling despite being a second order optimizer.

August 1, 2025 at 9:39 AM

Andrei Mircea

@mirandrom.bsky.social

Step 1: Understand how scaling improves LLMs.
Step 2: Directly target underlying mechanism.
Step 3: Improve LLMs independent of scale. Profit.

In our ACL 2025 paper we look at Step 1 in terms of training dynamics.

Project: mirandrom.github.io/zsl
Paper: arxiv.org/pdf/2506.05447

July 12, 2025 at 6:44 PM

Andrei Mircea

@mirandrom.bsky.social

🧵 (12/N) If you’re still reading, here are some neat plots to express my gratitude. These are per-token loss landscape cross-sections, taken along weight update directions at different train steps. Also equivalent cross-sections of overall losses extruded in 3D because why not.

December 15, 2024 at 5:40 PM

Andrei Mircea

@mirandrom.bsky.social

🧵 (9/N) Explaining ZSL with systematic gradient opposition (SGO)

In our paper, we show how SGO (when destructive interference in per-example gradients approaches 1) fundamentally results in ZSL, and confirm it occurs with and explains deceleration.

December 15, 2024 at 5:38 PM

Andrei Mircea

@mirandrom.bsky.social

🧵 (8/N) To go beyond co-occurrence, we disentangle the relative contribution of ZSL to slowing loss improvements and show that it is indeed the principal contributor to loss deceleration across scales.

December 15, 2024 at 5:37 PM

Andrei Mircea

@mirandrom.bsky.social

🧵 (7/N) To quantify ZSL, we define destructive interference as the rate with which elements in a sum cancel-out, and measure it for per-example loss improvements throughout training. Consistent with our hypothesis, ZSL occurs with deceleration and is decreased by scale.

December 15, 2024 at 5:36 PM

Andrei Mircea

@mirandrom.bsky.social

🧵 (4/N) Specifically, scaling seems to improve loss by improving 1) the loss at which deceleration occurs; and 2) the log-log rate of loss improvement after deceleration. Using BNSL, we can measure these quantities and tie them to final loss (i.e. to scaling improvements).

December 15, 2024 at 5:33 PM

Andrei Mircea

@mirandrom.bsky.social

🧵 (3/N) Explaining scaling improvements with loss deceleration
Scaling improvements can be expressed in terms of mitigating “loss deceleration”, an abrupt slowdown in the rate of loss improvement; characterized by piecewise linear log-log loss curves.

December 15, 2024 at 5:32 PM

Andrei Mircea

@mirandrom.bsky.social

🧵 (2/12) Motivation
LLM scaling laws predict but do not explain *how* scaling model size improves loss.
By identifying a mechanism underlying scaling improvements, we could target it directly and potentially improve LLMs independent of scale.

December 15, 2024 at 5:31 PM

Andrei Mircea

@mirandrom.bsky.social

📢 New paper “Language model scaling laws and zero-sum learning” at Sci4DL #neurips2024.

ℹ️ openreview.net/forum?id=yBq2g832Go TL;DR: scaling improves LMs by mitigating zero-sum learning, a mechanism that could be targeted directly and independent of scale.

West 205-207 4:30-5:30 PM

🧵 (1/12)

December 15, 2024 at 5:30 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news