10/n
10/n
9/n
9/n
1) More sophisticated routing mechanisms may further boost performance.
2) Better methods for reducing student size will likely increase quality and latency.
3) Sharing/interaction schemes among students may enhance training efficiency.
8/n
1) More sophisticated routing mechanisms may further boost performance.
2) Better methods for reducing student size will likely increase quality and latency.
3) Sharing/interaction schemes among students may enhance training efficiency.
8/n
7/n
7/n
6/n
6/n
5/n
5/n
1) At training time, MSD partitions the dataset, and assigns them to different students;
2) At inference time, MSD uses only one student.
4/n
1) At training time, MSD partitions the dataset, and assigns them to different students;
2) At inference time, MSD uses only one student.
4/n
It also allows improved latency with its new ability to train smaller-sized students by focusing their capacity on different subsets of data.
3/n
It also allows improved latency with its new ability to train smaller-sized students by focusing their capacity on different subsets of data.
3/n
Paper: arxiv.org/abs/2410.23274
w/ @jonLorraine9 , Weili Nie, Karsten Kreis, James Lucas
Supported (indirectly) by Harvard Statistics Department,
@vectorinst.bsky.social
, Department of Computer Science, University of Toronto
2/n
Paper: arxiv.org/abs/2410.23274
w/ @jonLorraine9 , Weili Nie, Karsten Kreis, James Lucas
Supported (indirectly) by Harvard Statistics Department,
@vectorinst.bsky.social
, Department of Computer Science, University of Toronto
2/n