Is it a way to smuggle more computation into smaller model without looking at the data much more times?
Is it a way to smuggle more computation into smaller model without looking at the data much more times?
We have been pondering this during summer and developed a new model: JetFormer 🌊🤖
arxiv.org/abs/2411.19722
A thread 👇
1/
We have been pondering this during summer and developed a new model: JetFormer 🌊🤖
arxiv.org/abs/2411.19722
A thread 👇
1/