Lightnews — Scholar-powered news

John Zila

@jzila.com

15 followers 13 following 5 posts

Posts Replies Media Videos

John Zila

@jzila.com

So although this predicts improved compute efficiency for training and inference, especially as reasoning-model distillation truly achieves critical mass, we still need best-in-class integrated hardware for compute and memory. NVIDIA is on the forefront here with their superchip architecture.

January 27, 2025 at 8:37 PM

John Zila

@jzila.com

Although DeepSeek V3 and R1 only use 37B active parameters (which is VRAM-efficient for a single token inference), the entire 671B model must still be loaded into system memory, and swapping the experts for each 1-2 token sequence will incur a latency cost.

January 27, 2025 at 8:35 PM

John Zila

@jzila.com

405^2 / (37 * 671) ~ 6.6, which accounts for more than half of the compute savings for training DeepSeek V3 compared to Llama 3.1 405b.

January 27, 2025 at 8:29 PM

John Zila

@jzila.com

From [1], it's reported that Llama 3.1 405b took 30.8M GPU hours to train. DeepSeek V3 took 2.6M GPU hours, for a ratio of about 11.8.

Model compute needs scale O(n^2) with model parameter size, but MoE models use far fewer active parameters.

[1] www.interconnects.ai/p/deepseek-v...

DeepSeek V3 and the cost of frontier AI models

The $5M figure for the last training run should not be your basis for how much frontier AI models cost.

www.interconnects.ai

January 27, 2025 at 8:28 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news