Gabriele Sarti
banner
gsarti.com
Gabriele Sarti
@gsarti.com
PhD Student at @gronlp.bsky.social 🐮, core dev @inseq.org. Interpretability ∩ HCI ∩ #NLProc.

gsarti.com
Pinned
I've decided to start a book thread for 2025 to share cool books and stay focused on my reading goals. Here we go! 📚
Reposted by Gabriele Sarti
Humans and LLMs think fast and slow. Do SAEs recover slow concepts in LLMs? Not really.

Our Temporal Feature Analyzer discovers contextual features in LLMs, that detect event boundaries, parse complex grammar, and represent ICL patterns.
November 13, 2025 at 10:32 PM
New promising model for interpretability research just dropped!
Through this release, we aim both to support the emerging ecosystem for pretraining research (NanoGPT, NanoChat), explainability (you can literally look at Monad under a microscope) and the tooling orchestration around frontier models.
November 10, 2025 at 9:09 PM
Check out our awesome live-skeeted panel!
Our panel moderated by @danaarad.bsky.social
"Evaluating Interpretability Methods: Challenges and Future Directions" just started! 🎉 Come to learn more about the MIB benchmark and hear the takes of @michaelwhanna.bsky.social, Michal Golovanevsky, Nicolò Brunello and Mingyang Wang!
November 9, 2025 at 7:18 AM
Follow @blackboxnlp.bsky.social for a live skeeting of the event!
BlackboxNLP is up and running! Here's the topics covered by this year's edition at a glance. Excited to see so many interesting topics, and the growing interest in reasoning!
November 9, 2025 at 2:20 AM
Wrapping up my oral presentations today with our TACL paper "QE4PE: Quality Estimation for Human Post-editing" at the Interpretability morning session #EMNLP2025 (Room A104, 11:45 China time)!

Paper: arxiv.org/abs/2503.03044
Slides/video/poster: underline.io/lecture/1315...
November 7, 2025 at 2:50 AM
Presenting today our work "Unsupervised Word-level Quality Estimation Through the Lens of Annotator (Dis)agreement" at the Machine Translation morning session (Room A301, 11:45 China time). See you there! 🤗

Paper: aclanthology.org/2025.emnlp-m...
Slides/video/poster: underline.io/events/502/s...
Unsupervised Word-level Quality Estimation for Machine Translation Through the Lens of Annotators (Dis)agreement
Gabriele Sarti, Vilém Zouhar, Malvina Nissim, Arianna Bisazza. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025.
aclanthology.org
November 6, 2025 at 1:19 AM
Reposted by Gabriele Sarti
How can a language model find the veggies in a menu?

New pre-print where we investigate the internal mechanisms of LLMs when filtering on a list of options.

Spoiler: turns out LLMs use strategies surprisingly similar to functional programming (think "filter" from python)! 🧵
November 4, 2025 at 5:48 PM
Reposted by Gabriele Sarti
Language models can correctly answer questions about their previous intentions.
www.anthropic.com/research/int...
Emergent introspective awareness in large language models
Research from Anthropic on the ability of large language models to introspect
www.anthropic.com
October 29, 2025 at 6:21 PM
Reposted by Gabriele Sarti
Can AI simulate human behavior? 🧠
The promise is revolutionary for science & policy. But there’s a huge "IF": Do these simulations actually reflect reality?
To find out, we introduce SimBench: The first large-scale benchmark for group-level social simulation. (1/9)
October 28, 2025 at 4:54 PM
Our group @gronlp.bsky.social is coming in strong for #EMNLP2025! See you soon in Suzhou! 👋 🇨🇳
With only a week left for #EMNLP2025, we are happy to announce all the works we 🐮 will present 🥳 - come and say "hi" to our posters and presentations during the Main and the co-located events (*SEM and workshops) See you in Suzhou ✈️
October 28, 2025 at 7:41 AM
Reposted by Gabriele Sarti
You can easily save up to 65% of compute while improving performance on reasoning tasks 🤯 👀

Meet EAGer: We show that monitoring token-level uncertainty lets LLMs allocate compute dynamically - spending MORE on hard problems, LESS on easy ones.
🧵👇
October 16, 2025 at 12:07 PM
Reposted by Gabriele Sarti
𝐃𝐨 𝐲𝐨𝐮 𝐫𝐞𝐚𝐥𝐥𝐲 𝐰𝐚𝐧𝐭 𝐭𝐨 𝐬𝐞𝐞 𝐰𝐡𝐚𝐭 𝐦𝐮𝐥𝐭𝐢𝐥𝐢𝐧𝐠𝐮𝐚𝐥 𝐞𝐟𝐟𝐨𝐫𝐭 𝐥𝐨𝐨𝐤𝐬 𝐥𝐢𝐤𝐞? 🇨🇳🇮🇩🇸🇪

Here’s the proof! 𝐁𝐚𝐛𝐲𝐁𝐚𝐛𝐞𝐥𝐋𝐌 is the first Multilingual Benchmark of Developmentally Plausible Training Data available for 45 languages to the NLP community 🎉

arxiv.org/abs/2510.10159
October 14, 2025 at 5:01 PM
"Assuming linearly encoded concepts"
In honour of spooky month, share a 4 word horror story that only someone in your profession would understand.

rm -rf ~/
"The chancellor approved it"
October 12, 2025 at 4:26 PM
Very cool demonstration of how the @ndif-team.bsky.social Workbench allows for quick iteration on different prompt setups!
How embarrassing for me and confusing to the LLM!

OK, here it is fixed. Nice thing about workbench is that it just takes a second to edit the prompt, and you can see how the LLM responds, now deciding very early it should be ':'
October 11, 2025 at 8:12 PM
Making model internals accessible to domain experts in low-code interfaces will unlock the next step in making interpretability useful across a variety of domains. Very excited about the NDIF Workbench! 💡
Ever wished you could explore what's happening inside a 405B parameter model without writing any code? Workbench, our AI interpretability interface, is now live for public beta at workbench.ndif.us!
October 10, 2025 at 5:53 PM
I was amazed by how avant-garde this was, but 30min into Greg Egan's Permutation City and already stumbled on digital twins, longevity-crazed billionaires and widespread B2C rentable compute instances, all from 1994! 🤯 Really prescient!
TIL Ken Liu predicted an eerily familiar setting featuring OpenAI and sama-like characters + US-China race dynamics in his short story "The Perfect Match" from 2012.
October 4, 2025 at 9:19 AM
Reposted by Gabriele Sarti
Mechanistic interpretability often relies on *interventions* to study how DNNs work. Are these interventions enough to guarantee the features we find are not spurious? No!⚠️ In our new paper, we show many mech int methods implicitly rely on the linear representation hypothesis🧵
July 14, 2025 at 12:15 PM
What could go wrong when asking Claude to make an Imagine demo within Claude Imagine and using it to play Tic Tac Toe? When notified about the error, the model promptly adds "Sorry about that. Continue playing..." to the interface 😂
October 2, 2025 at 4:15 PM
Reposted by Gabriele Sarti
really neat clear explainer for the new on “centralizing flows” to theoretically model learning dynamics
Understanding Optimization in Deep Learning with Central Flows
centralflows.github.io
October 1, 2025 at 12:20 PM
Reposted by Gabriele Sarti
What's the right unit of analysis for understanding LLM internals? We explore in our mech interp survey (a major update from our 2024 ms).

We’ve added more recent work and more immediately actionable directions for future work. Now published in Computational Linguistics!
October 1, 2025 at 2:03 PM
Reposted by Gabriele Sarti
🔍 Are you curious about uncovering the underlying mechanisms and identifying the roles of model components (neurons, …) and abstractions (SAEs, …)?

We provide the first survey of concept description generation and evaluation methods.

Joint effort w/ @lkopf.bsky.social

📄 arxiv.org/abs/2510.01048
October 2, 2025 at 9:13 AM
Now with sleek flyers to test your skills in Italian crossword solving! 🤗 Join our #EVALITA2026 task!
September 23, 2025 at 7:17 AM
It is again the time of year when I beg @aclmeeting.bsky.social execs to rethink the current streaming platform system. For my #EMNLP2025 submissions, I am *required* to upload 2 video recordings + 2 posters + 2 slide decks. Why force both posters and talks for all? Nonsense.
September 15, 2025 at 3:20 PM
Language puzzles from "La Settimana Enigmistica" keep you up at night? Fear not! 🧩 Our new shared task on automatic crossword solving is now live at #EVALITA2026. Be sure to check it out!
🚨 Exciting news from #EVALITA2026 (@ailc-nlp.bsky.social)!
I'm co-organizing Cruciverb-IT, the first shared task on crossword solving 🧩✍️ together with Ciaccio C., @gsarti.com, Dell'Orletta F. and @malvinanissim.bsky.social!
If you love cracking crosswords (or cracking models that do), join us! 🎉
September 15, 2025 at 10:27 AM
Reposted by Gabriele Sarti
When reading AI reasoning text (aka CoT), we (humans) form a narrative about the underlying computation process, which we take as a transparent explanation of model behavior. But what if our narratives are wrong? We measure that and find it usually is.

Now on arXiv: arxiv.org/abs/2508.16599
Humans Perceive Wrong Narratives from AI Reasoning Texts
A new generation of AI models generates step-by-step reasoning text before producing an answer. This text appears to offer a human-readable window into their computation process, and is increasingly r...
arxiv.org
August 27, 2025 at 9:30 PM