mlnews.bsky.social
@mlnews.bsky.social
Reinforcement Learning via Self-Distillation

introduces SDPO, which converts tokenized feedback into a dense learning signal without any external teacher or explicit reward model.

arxiv.org/abs/2601.20802
February 2, 2026 at 8:53 AM
New paper "Teaching Models to Teach Themselves" introduces SOAR. It uses meta-RL to let a 'teacher' model generate stepping-stone problems for a 'student' to solve.

Surprising result: The teacher doesn't need to know the final answer to guide the student effectively. arxiv.org/abs/2601.18778
January 29, 2026 at 12:36 PM
Ministral 3 was introduced on the 13th of Jan

a family of parameter-efficient dense language models designed for compute and memory constrained applications, available in three sizes: 3B, 8B, and 14B. They present a recipe to derive the models through Cascade Distillation. arxiv.org/pdf/2601.08584
January 29, 2026 at 12:31 PM