Lightnews — Scholar-powered news

Edoardo Ponti

@edoardo-ponti.bsky.social

1.3K followers 78 following 27 posts

Assistant professor in Natural Language Processing at the University of Edinburgh and visiting professor at NVIDIA | A Kleene star shines on the hour of our meeting.

Posts Replies Media Videos

Reposted by Edoardo Ponti

Digital Futures

@digitaluom.bsky.social

Up next on stage, Dr. @edoardo-ponti.bsky.social ( @edinburgh-uni.bsky.social / NVIDIA)
🎤 “Adaptive Units of Computation: Towards Sublinear-Memory and Tokenizer-Free Foundation Models”

Fascinating glimpse into the next gen of foundation models.

#FoundationModels #NLP #TokenizerFree #ADSAI2025

June 9, 2025 at 1:16 PM

Edoardo Ponti

@edoardo-ponti.bsky.social

Thanks to the amazing collaborators Adrian Łańcucki, Konrad Staniszewski, and Piotr Nawrot!

It was amazing to spend a year at NVIDIA as a visiting professor!

arXiv: arxiv.org/pdf/2506.05345

Code and models coming soon!

June 6, 2025 at 12:33 PM

Edoardo Ponti

@edoardo-ponti.bsky.social

🏆 We evaluate inference-time hyper-scaling on DeepSeek R1-distilled models of different sizes, increasing accuracy on maths, science, and coding by up to 15 points for a given budget.

June 6, 2025 at 12:33 PM

Edoardo Ponti

@edoardo-ponti.bsky.social

💡The idea behind DMS is to *train* existing LLMs to evict tokens from the KV cache, while delaying the eviction some time after the decision.

This allows LLMs to preserve information while reducing latency and memory size.

June 6, 2025 at 12:33 PM

Edoardo Ponti

@edoardo-ponti.bsky.social

⚖️ The magic works only if accuracy is preserved even at high compression ratios.

Enter Dynamic Memory Sparsification (DMS), which achieves 8x KV cache compression with 1K training steps and retains accuracy better than SOTA methods.

June 6, 2025 at 12:33 PM

Edoardo Ponti

@edoardo-ponti.bsky.social

Code: github.com/PiotrNawrot/...

Paper: arxiv.org/abs/2504.17768

Thanks to the lead author, Piotr Nawrot, and all the amazing collaborators!

The Sparse Frontier: Sparse Attention Trade-offs in Transformer LLMs

Sparse attention offers a promising strategy to extend long-context capabilities in Transformer LLMs, yet its viability, its efficiency-accuracy trade-offs, and systematic scaling studies remain unexp...

arxiv.org

April 25, 2025 at 3:39 PM

Edoardo Ponti

@edoardo-ponti.bsky.social

4) Finally, we introduce novel scaling laws for sparse attention and validate them on held-out results: evidence that our findings will likely hold true broadly.

Our insights demonstrate that sparse attention will play a key role in next-generation foundation models.

April 25, 2025 at 3:39 PM

Edoardo Ponti

@edoardo-ponti.bsky.social

3) There is no single best strategy across tasks and phases.

However, on average Verticals-Slashes for prefilling and Quest for decoding are the most competitive. Context-aware, and highly adaptive variants are preferable.

April 25, 2025 at 3:39 PM

Edoardo Ponti

@edoardo-ponti.bsky.social

2) Sparsity attainable while statistically guaranteeing accuracy preservation is higher during decoding ✍️ than prefilling 🧠, and correlates with model size in the former.

Importantly, for most settings there is at least one degraded task, even at moderate compressions (<5x).

April 25, 2025 at 3:39 PM

Edoardo Ponti

@edoardo-ponti.bsky.social

1) For very long sequences, *larger and highly sparse models* are preferable to small, dense ones for the same FLOPS budget.

This suggests a strategy shift where scaling up model size must be combined with sparse attention to achieve an optimal trade-off.

April 25, 2025 at 3:39 PM

Edoardo Ponti

@edoardo-ponti.bsky.social

What's in the future?
- Richer proxies for meaning, including a temporal dimension and internal agent states
- The study of grammaticalization under the lens of groundedness

We release an extensive dataset to support these studies: osf.io/bdhna/

A Grounded Typology of Word Classes

Hosted on the Open Science Framework

osf.io

December 20, 2024 at 8:30 PM

Edoardo Ponti

@edoardo-ponti.bsky.social

We focus on the groundedness of lexical classes and find that it
- follows a continuous cline cross-linguistically: nouns > adjectives > verbs
- is non-zero even for functional classes (e.g., adpositions)
- is contextual, so agrees with psycholinguistic norms only in part

December 20, 2024 at 8:30 PM

Edoardo Ponti

@edoardo-ponti.bsky.social

We leverage advances in multilingual and multimodal foundation models to quantify their surprisal for both form alone and form given function

Their difference (pointwise mutual information) corresponds to the groundedness of a word: the remaining surprisal once function is known

December 20, 2024 at 8:30 PM

Edoardo Ponti

@edoardo-ponti.bsky.social

Two considerations:

1) reusing / interpolating old token is reminiscent of our FOCUS baseline. Unfortunately it degrades performance as even identical tokens may change their function.

2) you incur a large overhead for calculating the co-occurrence matrix for every new tokenizer.

December 12, 2024 at 10:55 PM

Edoardo Ponti

@edoardo-ponti.bsky.social

ML is the fox?

December 4, 2024 at 10:29 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news