This is "Qwen2.5 3B-base" model trained for 1000 RL steps only on CountDown task with correctness reward.
Checkpoint at huggingface.co/McGill-NLP/n...
This is "Qwen2.5 3B-base" model trained for 1000 RL steps only on CountDown task with correctness reward.
Checkpoint at huggingface.co/McGill-NLP/n...
github.com/McGill-NLP/n...
YouTube Video:
www.youtube.com/playlist?lis...
and yes, we recreated DeepSeek R1-Zero style-training on CountDown in ~10h with one A100.
github.com/McGill-NLP/n...
YouTube Video:
www.youtube.com/playlist?lis...
and yes, we recreated DeepSeek R1-Zero style-training on CountDown in ~10h with one A100.
- super hackable
- no TRL / Verl, no abstraction💆♂️
- Single GPU, full param tuning, 3B LLM
- Efficient (R1-zero countdown < 10h)
comes with a from-scratch, fully spelled out YT video [1/n]
- super hackable
- no TRL / Verl, no abstraction💆♂️
- Single GPU, full param tuning, 3B LLM
- Efficient (R1-zero countdown < 10h)
comes with a from-scratch, fully spelled out YT video [1/n]