Andrei Mircea
banner
mirandrom.bsky.social
Andrei Mircea
@mirandrom.bsky.social
PhD student at University of Montreal // Mila ··· mechanistic understanding of LLMs + Human-AI collaboration for science ··· http://mirandrom.github.io
we also found interesting differences in optimizers with respect to loss deceleration (www.arxiv.org/abs/2506.05447). Surprisingly, Muon had worse post-deceleration convergence. That suggests it exacerbates rather than reduces interference in language modeling despite being a second order optimizer.
August 1, 2025 at 9:39 AM
Step 1: Understand how scaling improves LLMs.
Step 2: Directly target underlying mechanism.
Step 3: Improve LLMs independent of scale. Profit.

In our ACL 2025 paper we look at Step 1 in terms of training dynamics.

Project: mirandrom.github.io/zsl
Paper: arxiv.org/pdf/2506.05447
July 12, 2025 at 6:44 PM
🧵 (12/N) If you’re still reading, here are some neat plots to express my gratitude. These are per-token loss landscape cross-sections, taken along weight update directions at different train steps. Also equivalent cross-sections of overall losses extruded in 3D because why not.
December 15, 2024 at 5:40 PM
🧵 (9/N) Explaining ZSL with systematic gradient opposition (SGO)

In our paper, we show how SGO (when destructive interference in per-example gradients approaches 1) fundamentally results in ZSL, and confirm it occurs with and explains deceleration.
December 15, 2024 at 5:38 PM
🧵 (8/N) To go beyond co-occurrence, we disentangle the relative contribution of ZSL to slowing loss improvements and show that it is indeed the principal contributor to loss deceleration across scales.
December 15, 2024 at 5:37 PM
🧵 (7/N) To quantify ZSL, we define destructive interference as the rate with which elements in a sum cancel-out, and measure it for per-example loss improvements throughout training. Consistent with our hypothesis, ZSL occurs with deceleration and is decreased by scale.
December 15, 2024 at 5:36 PM
🧵 (4/N) Specifically, scaling seems to improve loss by improving 1) the loss at which deceleration occurs; and 2) the log-log rate of loss improvement after deceleration. Using BNSL, we can measure these quantities and tie them to final loss (i.e. to scaling improvements).
December 15, 2024 at 5:33 PM
🧵 (3/N) Explaining scaling improvements with loss deceleration
Scaling improvements can be expressed in terms of mitigating “loss deceleration”, an abrupt slowdown in the rate of loss improvement; characterized by piecewise linear log-log loss curves.
December 15, 2024 at 5:32 PM
🧵 (2/12) Motivation
LLM scaling laws predict but do not explain *how* scaling model size improves loss.
By identifying a mechanism underlying scaling improvements, we could target it directly and potentially improve LLMs independent of scale.
December 15, 2024 at 5:31 PM
📢 New paper “Language model scaling laws and zero-sum learning” at Sci4DL #neurips2024.

ℹ️ openreview.net/forum?id=yBq2g832Go TL;DR: scaling improves LMs by mitigating zero-sum learning, a mechanism that could be targeted directly and independent of scale.

West 205-207 4:30-5:30 PM

🧵 (1/12)
December 15, 2024 at 5:30 PM