Andrei Mircea
@mirandrom.bsky.social
PhD student at University of Montreal // Mila ··· mechanistic understanding of LLMs + Human-AI collaboration for science ··· http://mirandrom.github.io
not really sure what that implies with respect to their results, but it's a surprising contrast with no obvious explanation
August 1, 2025 at 9:41 AM
not really sure what that implies with respect to their results, but it's a surprising contrast with no obvious explanation
we also found interesting differences in optimizers with respect to loss deceleration (www.arxiv.org/abs/2506.05447). Surprisingly, Muon had worse post-deceleration convergence. That suggests it exacerbates rather than reduces interference in language modeling despite being a second order optimizer.
August 1, 2025 at 9:39 AM
we also found interesting differences in optimizers with respect to loss deceleration (www.arxiv.org/abs/2506.05447). Surprisingly, Muon had worse post-deceleration convergence. That suggests it exacerbates rather than reduces interference in language modeling despite being a second order optimizer.
Thanks to my collaborators and mentors @katelobacheva.bsky.social, Irina Rish, Supriyo Chakraborty, and Nima Chitsazan.
Also Ashwinee Panda for coining "zero-sum learning", which is honestly a pretty great name.
Also Ashwinee Panda for coining "zero-sum learning", which is honestly a pretty great name.
July 12, 2025 at 6:48 PM
Thanks to my collaborators and mentors @katelobacheva.bsky.social, Irina Rish, Supriyo Chakraborty, and Nima Chitsazan.
Also Ashwinee Panda for coining "zero-sum learning", which is honestly a pretty great name.
Also Ashwinee Panda for coining "zero-sum learning", which is honestly a pretty great name.
All of our code and artefacts are also open, which hopefully will help.
Code: github.com/mirandrom/zsl
Checkpoints: huggingface.co/mirandrom/zs...
Wandb logs: wandb.ai/amr-amr/zsl/...
Code: github.com/mirandrom/zsl
Checkpoints: huggingface.co/mirandrom/zs...
Wandb logs: wandb.ai/amr-amr/zsl/...
July 12, 2025 at 6:46 PM
All of our code and artefacts are also open, which hopefully will help.
Code: github.com/mirandrom/zsl
Checkpoints: huggingface.co/mirandrom/zs...
Wandb logs: wandb.ai/amr-amr/zsl/...
Code: github.com/mirandrom/zsl
Checkpoints: huggingface.co/mirandrom/zs...
Wandb logs: wandb.ai/amr-amr/zsl/...
TL;DR We find two new phenomena (loss deceleration + zero-sum learning) and show quantifiably how scaling improves LLMs by mitigating these.
What’s cool is that these could potentially be mitigated independent of scaling (Step 2).
Exactly how to do this remains an open question.
What’s cool is that these could potentially be mitigated independent of scaling (Step 2).
Exactly how to do this remains an open question.
July 12, 2025 at 6:46 PM
TL;DR We find two new phenomena (loss deceleration + zero-sum learning) and show quantifiably how scaling improves LLMs by mitigating these.
What’s cool is that these could potentially be mitigated independent of scaling (Step 2).
Exactly how to do this remains an open question.
What’s cool is that these could potentially be mitigated independent of scaling (Step 2).
Exactly how to do this remains an open question.
Special thanks to @katelobacheva.bsky.social and Irina Rish from @mila-quebec.bsky.social for their supervision; and to Nima Chitsazan and Supriyo Chakraborty from CapitalOne for their support on this project during my summer internship there!
December 15, 2024 at 6:09 PM
Special thanks to @katelobacheva.bsky.social and Irina Rish from @mila-quebec.bsky.social for their supervision; and to Nima Chitsazan and Supriyo Chakraborty from CapitalOne for their support on this project during my summer internship there!
🧵 (12/N) If you’re still reading, here are some neat plots to express my gratitude. These are per-token loss landscape cross-sections, taken along weight update directions at different train steps. Also equivalent cross-sections of overall losses extruded in 3D because why not.
December 15, 2024 at 5:40 PM
🧵 (12/N) If you’re still reading, here are some neat plots to express my gratitude. These are per-token loss landscape cross-sections, taken along weight update directions at different train steps. Also equivalent cross-sections of overall losses extruded in 3D because why not.
🧵 (11/N) While our hypothesis and results confirm that there exists mechanisms underlying scaling improvements that can be targeted directly and independent of scale, they do not fully account for the effect of scaling on loss deceleration. This is something we’re working on!
December 15, 2024 at 5:39 PM
🧵 (11/N) While our hypothesis and results confirm that there exists mechanisms underlying scaling improvements that can be targeted directly and independent of scale, they do not fully account for the effect of scaling on loss deceleration. This is something we’re working on!
🧵 (10/N) We also observe that scaling decreases gradient opposition before deceleration, effectively contributing to greater loss improvements before deceleration. While SGO converges to ~1 across scales, its relative effect on ZSL appears to be mitigated by scale.
December 15, 2024 at 5:39 PM
🧵 (10/N) We also observe that scaling decreases gradient opposition before deceleration, effectively contributing to greater loss improvements before deceleration. While SGO converges to ~1 across scales, its relative effect on ZSL appears to be mitigated by scale.
🧵 (9/N) Explaining ZSL with systematic gradient opposition (SGO)
In our paper, we show how SGO (when destructive interference in per-example gradients approaches 1) fundamentally results in ZSL, and confirm it occurs with and explains deceleration.
In our paper, we show how SGO (when destructive interference in per-example gradients approaches 1) fundamentally results in ZSL, and confirm it occurs with and explains deceleration.
December 15, 2024 at 5:38 PM
🧵 (9/N) Explaining ZSL with systematic gradient opposition (SGO)
In our paper, we show how SGO (when destructive interference in per-example gradients approaches 1) fundamentally results in ZSL, and confirm it occurs with and explains deceleration.
In our paper, we show how SGO (when destructive interference in per-example gradients approaches 1) fundamentally results in ZSL, and confirm it occurs with and explains deceleration.
🧵 (8/N) To go beyond co-occurrence, we disentangle the relative contribution of ZSL to slowing loss improvements and show that it is indeed the principal contributor to loss deceleration across scales.
December 15, 2024 at 5:37 PM
🧵 (8/N) To go beyond co-occurrence, we disentangle the relative contribution of ZSL to slowing loss improvements and show that it is indeed the principal contributor to loss deceleration across scales.
🧵 (7/N) To quantify ZSL, we define destructive interference as the rate with which elements in a sum cancel-out, and measure it for per-example loss improvements throughout training. Consistent with our hypothesis, ZSL occurs with deceleration and is decreased by scale.
December 15, 2024 at 5:36 PM
🧵 (7/N) To quantify ZSL, we define destructive interference as the rate with which elements in a sum cancel-out, and measure it for per-example loss improvements throughout training. Consistent with our hypothesis, ZSL occurs with deceleration and is decreased by scale.
🧵 (6/N) In ZSL, systematic gradient opposition between tokens leads to degenerate training dynamics where improvements in one set of tokens are offset by degradation in another, bottlenecking the overall rate of improvement and leading to deceleration.
December 15, 2024 at 5:35 PM
🧵 (6/N) In ZSL, systematic gradient opposition between tokens leads to degenerate training dynamics where improvements in one set of tokens are offset by degradation in another, bottlenecking the overall rate of improvement and leading to deceleration.
🧵 (5/N) Explaining loss deceleration with zero-sum learning
In other words, by explaining loss deceleration (and the mitigating effect of scale) we can explain scaling improvements. We propose the zero-sum learning (ZSL) hypothesis as an explanation for deceleration.
In other words, by explaining loss deceleration (and the mitigating effect of scale) we can explain scaling improvements. We propose the zero-sum learning (ZSL) hypothesis as an explanation for deceleration.
December 15, 2024 at 5:34 PM
🧵 (5/N) Explaining loss deceleration with zero-sum learning
In other words, by explaining loss deceleration (and the mitigating effect of scale) we can explain scaling improvements. We propose the zero-sum learning (ZSL) hypothesis as an explanation for deceleration.
In other words, by explaining loss deceleration (and the mitigating effect of scale) we can explain scaling improvements. We propose the zero-sum learning (ZSL) hypothesis as an explanation for deceleration.
🧵 (4/N) Specifically, scaling seems to improve loss by improving 1) the loss at which deceleration occurs; and 2) the log-log rate of loss improvement after deceleration. Using BNSL, we can measure these quantities and tie them to final loss (i.e. to scaling improvements).
December 15, 2024 at 5:33 PM
🧵 (4/N) Specifically, scaling seems to improve loss by improving 1) the loss at which deceleration occurs; and 2) the log-log rate of loss improvement after deceleration. Using BNSL, we can measure these quantities and tie them to final loss (i.e. to scaling improvements).
🧵 (3/N) Explaining scaling improvements with loss deceleration
Scaling improvements can be expressed in terms of mitigating “loss deceleration”, an abrupt slowdown in the rate of loss improvement; characterized by piecewise linear log-log loss curves.
Scaling improvements can be expressed in terms of mitigating “loss deceleration”, an abrupt slowdown in the rate of loss improvement; characterized by piecewise linear log-log loss curves.
December 15, 2024 at 5:32 PM
🧵 (3/N) Explaining scaling improvements with loss deceleration
Scaling improvements can be expressed in terms of mitigating “loss deceleration”, an abrupt slowdown in the rate of loss improvement; characterized by piecewise linear log-log loss curves.
Scaling improvements can be expressed in terms of mitigating “loss deceleration”, an abrupt slowdown in the rate of loss improvement; characterized by piecewise linear log-log loss curves.
🧵 (2/12) Motivation
LLM scaling laws predict but do not explain *how* scaling model size improves loss.
By identifying a mechanism underlying scaling improvements, we could target it directly and potentially improve LLMs independent of scale.
LLM scaling laws predict but do not explain *how* scaling model size improves loss.
By identifying a mechanism underlying scaling improvements, we could target it directly and potentially improve LLMs independent of scale.
December 15, 2024 at 5:31 PM
🧵 (2/12) Motivation
LLM scaling laws predict but do not explain *how* scaling model size improves loss.
By identifying a mechanism underlying scaling improvements, we could target it directly and potentially improve LLMs independent of scale.
LLM scaling laws predict but do not explain *how* scaling model size improves loss.
By identifying a mechanism underlying scaling improvements, we could target it directly and potentially improve LLMs independent of scale.