The data is worth a look also as it shows how LM Arena results can be manipulated to be more pleasing to humans. t.co/rqAey9SMwh
The data is worth a look also as it shows how LM Arena results can be manipulated to be more pleasing to humans. t.co/rqAey9SMwh
Always test on full weight, non-distilled DeepSeek.
Always test on full weight, non-distilled DeepSeek.
The top 4 in this post have worked well for me.
aider.chat/2025/01/28/d...
The top 4 in this post have worked well for me.
aider.chat/2025/01/28/d...
... and it's currently free. 🤯
... and it's currently free. 🤯
- RL generalizes in rule-based envs, esp. when trained with an outcome-based reward
- SFT tends to memorize the training data and struggles to generalize OOD
- RL generalizes in rule-based envs, esp. when trained with an outcome-based reward
- SFT tends to memorize the training data and struggles to generalize OOD