John Zila
jzila.com
John Zila
@jzila.com
So although this predicts improved compute efficiency for training and inference, especially as reasoning-model distillation truly achieves critical mass, we still need best-in-class integrated hardware for compute and memory. NVIDIA is on the forefront here with their superchip architecture.
January 27, 2025 at 8:37 PM
Although DeepSeek V3 and R1 only use 37B active parameters (which is VRAM-efficient for a single token inference), the entire 671B model must still be loaded into system memory, and swapping the experts for each 1-2 token sequence will incur a latency cost.
January 27, 2025 at 8:35 PM
405^2 / (37 * 671) ~ 6.6, which accounts for more than half of the compute savings for training DeepSeek V3 compared to Llama 3.1 405b.
January 27, 2025 at 8:29 PM
From [1], it's reported that Llama 3.1 405b took 30.8M GPU hours to train. DeepSeek V3 took 2.6M GPU hours, for a ratio of about 11.8.

Model compute needs scale O(n^2) with model parameter size, but MoE models use far fewer active parameters.

[1] www.interconnects.ai/p/deepseek-v...
DeepSeek V3 and the cost of frontier AI models
The $5M figure for the last training run should not be your basis for how much frontier AI models cost.
www.interconnects.ai
January 27, 2025 at 8:28 PM