Cosmin Stamate
banner
stamate.bsky.social
Cosmin Stamate
@stamate.bsky.social
AI & ML Scientist | Researcher • Engineer • Lecturer
Original: x.com/rohanpaul_ai/status/1948572304809611701
July 25, 2025 at 8:13 AM
... Paper – https://arxiv.org/abs/2507.16003https://arxiv.org/abs/2507.16003

Paper Title: "Learning without training: The implicit dynamics of in-context learning"
July 25, 2025 at 8:13 AM
... Results cover only the first generated token and one transformer block without MLP skip, so full‑stack models need more work.

Still, the finding hints that many in‑context tricks come from weight geometry rather than quirky attention rules.

--- ...
July 25, 2025 at 8:13 AM
... 🤝 Finetune vs. Implicit Patch

They compare classic gradient finetuning on the same examples to the single‑shot patch strategy.

Both methods cut test loss in a similar pattern, yet the patch avoids any real back‑prop and keeps the rest of the network frozen.

---

🔎 Limits They Admit ...
July 25, 2025 at 8:13 AM
... 🔬 Testing on Simple Linear Tasks

They train a small transformer to map x→w·x using 50 prompt pairs plus 1 query.

When they swap the prompt for its equivalent rank 1 patch and feed only the query, the loss curve overlaps the full‑prompt run almost perfectly.

That overlap

--- ...
July 25, 2025 at 8:13 AM
... 📐 Hidden Gradient Descent

Feeding tokens one by one stacks these tiny patches.

Proposition 3.1 proves each added token shifts the weights the same way online gradient descent would, with a step size tied to the query vector length.

The shift shrinks as soon as a token stops

--- ...
July 25, 2025 at 8:13 AM
... 🧩 How the Patch Works

Theorem 2.2 shows a formula: multiply the base weights by the context change vector, then project it with the query representation, boom, you get the patch.

Because the patch is rank 1, it stores almost no extra parameters yet still carries the full

--- ...
July 25, 2025 at 8:13 AM
... 🛠️ Temporary rank 1 patch

A transformer block first runs the self‑attention layer and gets two things for the query token: the usual activation and the tiny difference between “with context” and “without context”.

It multiplies that difference by the frozen weight matrix, then

--- ...
July 25, 2025 at 8:13 AM
Original: x.com/hardmaru/status/1947998113450631350
July 23, 2025 at 1:32 PM
Original: x.com/mihirp98/status/1947736993229885545
July 23, 2025 at 12:52 PM
... In collaboration with Amir Zadeh, Katerina Fragkiadaki (@KaterinaFragiad@KaterinaFragiad) and Deepak Pathak (@pathak2206@pathak2206) at @mldcmu@mldcmu
July 23, 2025 at 12:52 PM
... Project webpage & code - https://diffusion-scaling.github.iohttps://diffusion-scaling.github.io

Arxiv - https://arxiv.org/abs/2507.15857https://arxiv.org/abs/2507.15857

This project was co-led with Menging Wu (@WuMengning54261@WuMengning54261). ...
July 23, 2025 at 12:52 PM
... 🚨#8: A natural question here is—why does diffusion outperform AR when data is limited?

We hypothesize that the key advantage stems from the use of random masking in diffusion models, which serves as a form of data augmentation. Unlike AR models, which are trained on a single,

---

🚨#9: ...
July 23, 2025 at 12:52 PM
... 🚨Finding #7: The data efficiency of diffusion models translates to better downstream performance.

Lastly we evaluate the best-performing diffusion and AR models (trained under the same data budget) on a range of language understanding tasks.

Across most benchmarks, diffusion

--- ...
July 23, 2025 at 12:52 PM
... 🚨 Finding #6: The compute required for diffusion to outperform AR follows a predictable power law.

Above we defined the critical compute threshold as the amount of FLOPs where diffusion matches AR performance for a given unique dataset size.

We find that we can derive a simple

--- ...
July 23, 2025 at 12:52 PM
... ---

🚨 Finding #5: Muennighoff et al showed that repeating the dataset up to 4 epochs is nearly as effective as using fresh data for autoregressive models.

In contrast, we find that diffusion models can be trained on repeated data for up to 100 epochs, while having repeated data

--- ...
July 23, 2025 at 12:52 PM
... 🚨 Finding #4: Diffusion models exhibit a much higher half-life of data reuse (R_D*) —i.e., the number of epochs after which returns from repeating data begins to significantly diminish.

We adopt the data-constrained scaling framework introduced by @Muennighoff@Muennighoff et al. in their ...
July 23, 2025 at 12:52 PM
... 🚨Finding #3: Diffusion models are significantly more robust to data repetition than autoregressive (AR) models.

We show training curves of models trained with the same total compute, but different trade-offs between unique data and number of epochs.

An “epoch” here means

--- ...
July 23, 2025 at 12:52 PM
... 🚨 Finding #2: Autoregressive models begin to overfit much quickly, while diffusion shows no signs of overfitting even after 10x the number of epochs.
In the above figure, we showed that increasing compute eventually favors diffusion. But compute can be scaled in two ways:

(i)

--- ...
July 23, 2025 at 12:52 PM
... 🚨 Finding #1: Diffusion models outperform autoregressive models when trained with sufficient compute (i.e., more epochs & parameters).

Across different unique data scales, we observe:

1️⃣ At low compute, Autoregressive models win.
2️⃣ After a certain amount of compute,

--- ...
July 23, 2025 at 12:52 PM
Original: x.com/mihirp98/status/1947736993229885545
July 23, 2025 at 12:09 PM
... In collaboration with Amir Zadeh, Katerina Fragkiadaki (@KaterinaFragiad@KaterinaFragiad) and Deepak Pathak (@pathak2206@pathak2206) at @mldcmu@mldcmu
July 23, 2025 at 12:09 PM