At 1.23B params, the gap in PPL between ByT5 and MrT5 shrinks dramatically—suggesting that MrT5’s deletion mechanism scales effectively with model size.
This means: better efficiency–performance trade-offs in high-resource settings.
At 1.23B params, the gap in PPL between ByT5 and MrT5 shrinks dramatically—suggesting that MrT5’s deletion mechanism scales effectively with model size.
This means: better efficiency–performance trade-offs in high-resource settings.
In the final version, we include:
- A new controller algorithm for targeted compression rates
- More baselines and downstream tasks
- MrT5 at 1.23B parameter scale
In the final version, we include:
- A new controller algorithm for targeted compression rates
- More baselines and downstream tasks
- MrT5 at 1.23B parameter scale
🪧 Poster: 10–12:30 in Hall 3 + 2B (#273)
⚡️ Lightning talk: right after in Opal 103–104 (Session on Tokenizer-Free, End-to-end Architectures)
Plus, MrT5 has many exciting updates 🧵
🪧 Poster: 10–12:30 in Hall 3 + 2B (#273)
⚡️ Lightning talk: right after in Opal 103–104 (Session on Tokenizer-Free, End-to-end Architectures)
Plus, MrT5 has many exciting updates 🧵