🎤 “Adaptive Units of Computation: Towards Sublinear-Memory and Tokenizer-Free Foundation Models”
Fascinating glimpse into the next gen of foundation models.
#FoundationModels #NLP #TokenizerFree #ADSAI2025
🎤 “Adaptive Units of Computation: Towards Sublinear-Memory and Tokenizer-Free Foundation Models”
Fascinating glimpse into the next gen of foundation models.
#FoundationModels #NLP #TokenizerFree #ADSAI2025
It was amazing to spend a year at NVIDIA as a visiting professor!
arXiv: arxiv.org/pdf/2506.05345
Code and models coming soon!
It was amazing to spend a year at NVIDIA as a visiting professor!
arXiv: arxiv.org/pdf/2506.05345
Code and models coming soon!
This allows LLMs to preserve information while reducing latency and memory size.
This allows LLMs to preserve information while reducing latency and memory size.
Enter Dynamic Memory Sparsification (DMS), which achieves 8x KV cache compression with 1K training steps and retains accuracy better than SOTA methods.
Enter Dynamic Memory Sparsification (DMS), which achieves 8x KV cache compression with 1K training steps and retains accuracy better than SOTA methods.
Paper: arxiv.org/abs/2504.17768
Thanks to the lead author, Piotr Nawrot, and all the amazing collaborators!
Paper: arxiv.org/abs/2504.17768
Thanks to the lead author, Piotr Nawrot, and all the amazing collaborators!
Our insights demonstrate that sparse attention will play a key role in next-generation foundation models.
Our insights demonstrate that sparse attention will play a key role in next-generation foundation models.
However, on average Verticals-Slashes for prefilling and Quest for decoding are the most competitive. Context-aware, and highly adaptive variants are preferable.
However, on average Verticals-Slashes for prefilling and Quest for decoding are the most competitive. Context-aware, and highly adaptive variants are preferable.
Importantly, for most settings there is at least one degraded task, even at moderate compressions (<5x).
Importantly, for most settings there is at least one degraded task, even at moderate compressions (<5x).
This suggests a strategy shift where scaling up model size must be combined with sparse attention to achieve an optimal trade-off.
This suggests a strategy shift where scaling up model size must be combined with sparse attention to achieve an optimal trade-off.
- Richer proxies for meaning, including a temporal dimension and internal agent states
- The study of grammaticalization under the lens of groundedness
We release an extensive dataset to support these studies: osf.io/bdhna/
- Richer proxies for meaning, including a temporal dimension and internal agent states
- The study of grammaticalization under the lens of groundedness
We release an extensive dataset to support these studies: osf.io/bdhna/
- follows a continuous cline cross-linguistically: nouns > adjectives > verbs
- is non-zero even for functional classes (e.g., adpositions)
- is contextual, so agrees with psycholinguistic norms only in part
- follows a continuous cline cross-linguistically: nouns > adjectives > verbs
- is non-zero even for functional classes (e.g., adpositions)
- is contextual, so agrees with psycholinguistic norms only in part
Their difference (pointwise mutual information) corresponds to the groundedness of a word: the remaining surprisal once function is known
Their difference (pointwise mutual information) corresponds to the groundedness of a word: the remaining surprisal once function is known
1) reusing / interpolating old token is reminiscent of our FOCUS baseline. Unfortunately it degrades performance as even identical tokens may change their function.
2) you incur a large overhead for calculating the co-occurrence matrix for every new tokenizer.
1) reusing / interpolating old token is reminiscent of our FOCUS baseline. Unfortunately it degrades performance as even identical tokens may change their function.
2) you incur a large overhead for calculating the co-occurrence matrix for every new tokenizer.