Paper Title: "Learning without training: The implicit dynamics of in-context learning"
Paper Title: "Learning without training: The implicit dynamics of in-context learning"
Still, the finding hints that many in‑context tricks come from weight geometry rather than quirky attention rules.
--- ...
Still, the finding hints that many in‑context tricks come from weight geometry rather than quirky attention rules.
--- ...
They compare classic gradient finetuning on the same examples to the single‑shot patch strategy.
Both methods cut test loss in a similar pattern, yet the patch avoids any real back‑prop and keeps the rest of the network frozen.
---
🔎 Limits They Admit ...
They compare classic gradient finetuning on the same examples to the single‑shot patch strategy.
Both methods cut test loss in a similar pattern, yet the patch avoids any real back‑prop and keeps the rest of the network frozen.
---
🔎 Limits They Admit ...
They train a small transformer to map x→w·x using 50 prompt pairs plus 1 query.
When they swap the prompt for its equivalent rank 1 patch and feed only the query, the loss curve overlaps the full‑prompt run almost perfectly.
That overlap
--- ...
They train a small transformer to map x→w·x using 50 prompt pairs plus 1 query.
When they swap the prompt for its equivalent rank 1 patch and feed only the query, the loss curve overlaps the full‑prompt run almost perfectly.
That overlap
--- ...
Feeding tokens one by one stacks these tiny patches.
Proposition 3.1 proves each added token shifts the weights the same way online gradient descent would, with a step size tied to the query vector length.
The shift shrinks as soon as a token stops
--- ...
Feeding tokens one by one stacks these tiny patches.
Proposition 3.1 proves each added token shifts the weights the same way online gradient descent would, with a step size tied to the query vector length.
The shift shrinks as soon as a token stops
--- ...
Theorem 2.2 shows a formula: multiply the base weights by the context change vector, then project it with the query representation, boom, you get the patch.
Because the patch is rank 1, it stores almost no extra parameters yet still carries the full
--- ...
Theorem 2.2 shows a formula: multiply the base weights by the context change vector, then project it with the query representation, boom, you get the patch.
Because the patch is rank 1, it stores almost no extra parameters yet still carries the full
--- ...
A transformer block first runs the self‑attention layer and gets two things for the query token: the usual activation and the tiny difference between “with context” and “without context”.
It multiplies that difference by the frozen weight matrix, then
--- ...
A transformer block first runs the self‑attention layer and gets two things for the query token: the usual activation and the tiny difference between “with context” and “without context”.
It multiplies that difference by the frozen weight matrix, then
--- ...
Arxiv - https://arxiv.org/abs/2507.15857https://arxiv.org/abs/2507.15857
This project was co-led with Menging Wu (@WuMengning54261@WuMengning54261). ...
Arxiv - https://arxiv.org/abs/2507.15857https://arxiv.org/abs/2507.15857
This project was co-led with Menging Wu (@WuMengning54261@WuMengning54261). ...
We hypothesize that the key advantage stems from the use of random masking in diffusion models, which serves as a form of data augmentation. Unlike AR models, which are trained on a single,
---
🚨#9: ...
We hypothesize that the key advantage stems from the use of random masking in diffusion models, which serves as a form of data augmentation. Unlike AR models, which are trained on a single,
---
🚨#9: ...
Lastly we evaluate the best-performing diffusion and AR models (trained under the same data budget) on a range of language understanding tasks.
Across most benchmarks, diffusion
--- ...
Lastly we evaluate the best-performing diffusion and AR models (trained under the same data budget) on a range of language understanding tasks.
Across most benchmarks, diffusion
--- ...
Above we defined the critical compute threshold as the amount of FLOPs where diffusion matches AR performance for a given unique dataset size.
We find that we can derive a simple
--- ...
Above we defined the critical compute threshold as the amount of FLOPs where diffusion matches AR performance for a given unique dataset size.
We find that we can derive a simple
--- ...
🚨 Finding #5: Muennighoff et al showed that repeating the dataset up to 4 epochs is nearly as effective as using fresh data for autoregressive models.
In contrast, we find that diffusion models can be trained on repeated data for up to 100 epochs, while having repeated data
--- ...
🚨 Finding #5: Muennighoff et al showed that repeating the dataset up to 4 epochs is nearly as effective as using fresh data for autoregressive models.
In contrast, we find that diffusion models can be trained on repeated data for up to 100 epochs, while having repeated data
--- ...
We adopt the data-constrained scaling framework introduced by @Muennighoff@Muennighoff et al. in their ...
We adopt the data-constrained scaling framework introduced by @Muennighoff@Muennighoff et al. in their ...
We show training curves of models trained with the same total compute, but different trade-offs between unique data and number of epochs.
An “epoch” here means
--- ...
We show training curves of models trained with the same total compute, but different trade-offs between unique data and number of epochs.
An “epoch” here means
--- ...
In the above figure, we showed that increasing compute eventually favors diffusion. But compute can be scaled in two ways:
(i)
--- ...
In the above figure, we showed that increasing compute eventually favors diffusion. But compute can be scaled in two ways:
(i)
--- ...
Across different unique data scales, we observe:
1️⃣ At low compute, Autoregressive models win.
2️⃣ After a certain amount of compute,
--- ...
Across different unique data scales, we observe:
1️⃣ At low compute, Autoregressive models win.
2️⃣ After a certain amount of compute,
--- ...