Dan Busbridge
@dbusbridge.bsky.social
Machine Learning Research @ Apple (opinions are my own)
Pinned
Dan Busbridge
@dbusbridge.bsky.social
· Feb 13
Distillation Scaling Laws
We provide a distillation scaling law that estimates distilled model performance based on a compute budget and its allocation between the student and teacher. Our findings reduce the risks associated ...
arxiv.org
Reading "Distilling Knowledge in a Neural Network" left me fascinated and wondering:
"If I want a small, capable model, should I distill from a more powerful model, or train from scratch?"
Our distillation scaling law shows, well, it's complicated... 🧵
arxiv.org/abs/2502.08606
"If I want a small, capable model, should I distill from a more powerful model, or train from scratch?"
Our distillation scaling law shows, well, it's complicated... 🧵
arxiv.org/abs/2502.08606
Several people have asked me to comment further on the connection between our work and the Patient and Consistent Teachers study by Beyer et al., as In Section 5.1, we note that our findings appear to contradict theirs.
arxiv.org/abs/2106.05237
arxiv.org/abs/2106.05237
February 13, 2025 at 9:52 PM
Several people have asked me to comment further on the connection between our work and the Patient and Consistent Teachers study by Beyer et al., as In Section 5.1, we note that our findings appear to contradict theirs.
arxiv.org/abs/2106.05237
arxiv.org/abs/2106.05237
Reading "Distilling Knowledge in a Neural Network" left me fascinated and wondering:
"If I want a small, capable model, should I distill from a more powerful model, or train from scratch?"
Our distillation scaling law shows, well, it's complicated... 🧵
arxiv.org/abs/2502.08606
"If I want a small, capable model, should I distill from a more powerful model, or train from scratch?"
Our distillation scaling law shows, well, it's complicated... 🧵
arxiv.org/abs/2502.08606
Distillation Scaling Laws
We provide a distillation scaling law that estimates distilled model performance based on a compute budget and its allocation between the student and teacher. Our findings reduce the risks associated ...
arxiv.org
February 13, 2025 at 9:50 PM
Reading "Distilling Knowledge in a Neural Network" left me fascinated and wondering:
"If I want a small, capable model, should I distill from a more powerful model, or train from scratch?"
Our distillation scaling law shows, well, it's complicated... 🧵
arxiv.org/abs/2502.08606
"If I want a small, capable model, should I distill from a more powerful model, or train from scratch?"
Our distillation scaling law shows, well, it's complicated... 🧵
arxiv.org/abs/2502.08606