@huggingface page: https://huggingface.co/papers/2502.12996
congrats to my collaborators @SatyenKale who led that work and Yani Donchev
@huggingface page: https://huggingface.co/papers/2502.12996
congrats to my collaborators @SatyenKale who led that work and Yani Donchev
DP requires 471 Gbits/s
Streaming DiLoCo with inner com. overlap: 1.4 Gbits/s
Streaming DiLoCo with eager outer com. overlap: 400Mbits/s, more than 1000x reduction
400Mbits/s is consumer-grade bandwidth FYI
DP requires 471 Gbits/s
Streaming DiLoCo with inner com. overlap: 1.4 Gbits/s
Streaming DiLoCo with eager outer com. overlap: 400Mbits/s, more than 1000x reduction
400Mbits/s is consumer-grade bandwidth FYI
the update is made of the average of the *local up-to-date* update of the self replica and the *remote stale* update from the other replicas
the update is made of the average of the *local up-to-date* update of the self replica and the *remote stale* update from the other replicas
we can recover a bit by lowering the outer learning by 4x, but this is still unsatisfying
we can recover a bit by lowering the outer learning by 4x, but this is still unsatisfying
we first try a naive "delayed" version
we first try a naive "delayed" version
can we overlap communication with computation over hundred of steps?
-- yes we can
in this work led by @SatyenKale, we improve DiLoCo and use x1177 less bandwidth than data-parallel
can we overlap communication with computation over hundred of steps?
-- yes we can
in this work led by @SatyenKale, we improve DiLoCo and use x1177 less bandwidth than data-parallel
"asynchronous training where each copy of the model does local computation [...] it makes people uncomfortable [...] but it actually works"
yep, i can confirm, it does work for real
see arxiv.org/abs/2501.18512
"asynchronous training where each copy of the model does local computation [...] it makes people uncomfortable [...] but it actually works"
yep, i can confirm, it does work for real
see arxiv.org/abs/2501.18512
Can we break away from the lockstep synchronization? yes!
Workers can have a few delay steps, and it just work, w/o any special handling.
Can we break away from the lockstep synchronization? yes!
Workers can have a few delay steps, and it just work, w/o any special handling.
Over how many steps can you overlap safely communication?
At least 5 without any significant loss of perf! That's a massive increase of tolerated latency.
Over how many steps can you overlap safely communication?
At least 5 without any significant loss of perf! That's a massive increase of tolerated latency.
It has only 35B activated params, but you need to sync 671B params in total! Hard to do across continents with data-parallel...
However, with our method? ❤️🔥
It has only 35B activated params, but you need to sync 671B params in total! Hard to do across continents with data-parallel...
However, with our method? ❤️🔥
but we need to scale more our pretraining, this is still a relevant axis!
DeepSeek-R1 also notes it:
but we need to scale more our pretraining, this is still a relevant axis!
DeepSeek-R1 also notes it:
Abolish the tyranny of requiring huge bandwidth! ✊
Abolish the tyranny of requiring huge bandwidth! ✊
As @m_ryabinin noted in Swarm Parallelism: larger networks spent more time doing computation O(n^3) vs doing communication O(n^2).
We have much more time to sync at larger scales!
As @m_ryabinin noted in Swarm Parallelism: larger networks spent more time doing computation O(n^3) vs doing communication O(n^2).
We have much more time to sync at larger scales!
It estimates how much time is spent in the costly com. between non-colocated devices and how much is spent crunching flops.
It estimates how much time is spent in the costly com. between non-colocated devices and how much is spent crunching flops.
It's even better when overtraining with a larger token budget? remember the bitter lesson? just put more data and flops in your model, Streaming DiLoCo enables that.
It's even better when overtraining with a larger token budget? remember the bitter lesson? just put more data and flops in your model, Streaming DiLoCo enables that.
Quantize your update with 4 bits is enough -- you can barely see any changes on the performance.
And that's the free dessert 🍦
Quantize your update with 4 bits is enough -- you can barely see any changes on the performance.
And that's the free dessert 🍦
See L9-12: the longer your model takes to do fwd/bwd, the more time you have to do communication!
That's the free main course 🍱
See L9-12: the longer your model takes to do fwd/bwd, the more time you have to do communication!
That's the free main course 🍱
We split the model in fragments of three layers, it reduces massively the peak bandwidth w/o hurting ML performance.
That's the free appetizer 🥗
We split the model in fragments of three layers, it reduces massively the peak bandwidth w/o hurting ML performance.
That's the free appetizer 🥗
1. partial synchronization
2. communication overlapping with computation
3. quantized communication
This is a distributed free lunch 🥪
1. partial synchronization
2. communication overlapping with computation
3. quantized communication
This is a distributed free lunch 🥪
DiLoCo (arxiv.org/abs/2311.08105 ) synchronizes less often --> amortizing the cost
Later, @PrimeIntellect released Intellect-1, a repro of DiLoCo, with a 10B model! arxiv.org/abs/2412.01152
DiLoCo (arxiv.org/abs/2311.08105 ) synchronizes less often --> amortizing the cost
Later, @PrimeIntellect released Intellect-1, a repro of DiLoCo, with a 10B model! arxiv.org/abs/2412.01152
however it's hard to have all that compute available in a single place, can we distribute it across the world?
however it's hard to have all that compute available in a single place, can we distribute it across the world?