Superhero @jonasgeiping.bsky.social started architecture search for this two years ago, & wrote a distributed framework from scratch to handle bugs with AMD 🤯
Superhero @jonasgeiping.bsky.social started architecture search for this two years ago, & wrote a distributed framework from scratch to handle bugs with AMD 🤯
Still, Huggin-3.5B can beat OLMo-7B-0724 (with CoT) at GSM8K by a wide margin (42% vs 29%).
Huginn has half the parameters, 1/3 the training tokens, no explicit fine-tuning, and the LR was never annealed.
Latent reasoning still wins.
Still, Huggin-3.5B can beat OLMo-7B-0724 (with CoT) at GSM8K by a wide margin (42% vs 29%).
Huginn has half the parameters, 1/3 the training tokens, no explicit fine-tuning, and the LR was never annealed.
Latent reasoning still wins.
We train a standard 3.5B LLM from scratch on 180B tokens. Then we train a recurrent 3.5B model on the same tokens.
The recurrent model does 5X better on GSM8K.
We train a standard 3.5B LLM from scratch on 180B tokens. Then we train a recurrent 3.5B model on the same tokens.
The recurrent model does 5X better on GSM8K.
We built our reasoning system by putting a recurrent block inside the LLM. On a forward pass, we loop this block a random number of times. By looping it more times, we dial up compute.
We built our reasoning system by putting a recurrent block inside the LLM. On a forward pass, we loop this block a random number of times. By looping it more times, we dial up compute.
With this throughput, DeepSeek’s 2.664M GPU-hour pre-training run would rip through 14.3T tokens. DeepSeek claims to have trained on 14.8T tokens.
This part checks out...but only with killer engineering.
With this throughput, DeepSeek’s 2.664M GPU-hour pre-training run would rip through 14.3T tokens. DeepSeek claims to have trained on 14.8T tokens.
This part checks out...but only with killer engineering.