introduces SDPO, which converts tokenized feedback into a dense learning signal without any external teacher or explicit reward model.
arxiv.org/abs/2601.20802
introduces SDPO, which converts tokenized feedback into a dense learning signal without any external teacher or explicit reward model.
arxiv.org/abs/2601.20802
Surprising result: The teacher doesn't need to know the final answer to guide the student effectively. arxiv.org/abs/2601.18778
Surprising result: The teacher doesn't need to know the final answer to guide the student effectively. arxiv.org/abs/2601.18778
a family of parameter-efficient dense language models designed for compute and memory constrained applications, available in three sizes: 3B, 8B, and 14B. They present a recipe to derive the models through Cascade Distillation. arxiv.org/pdf/2601.08584
a family of parameter-efficient dense language models designed for compute and memory constrained applications, available in three sizes: 3B, 8B, and 14B. They present a recipe to derive the models through Cascade Distillation. arxiv.org/pdf/2601.08584