Alessio Devoto
banner
alessiodevoto.bsky.social
Alessio Devoto
@alessiodevoto.bsky.social
PhD in ML/AI | Researching Efficient ML/AI (vision & language) 🍀 & Interpretability | @SapienzaRoma @EdinburghNLP | https://alessiodevoto.github.io/ | ex @NVIDIA
Pinned
In Vision & Audio transformers, not all tokens need the same compute resources! We propose “modular learners” to control compute at token-level granularity (MHA & MLP): hard tokens get more, easy ones get less!
w/ @sscardapane.bsky.social @neuralnoise.com @bartoszWojcik
Soon #AAAI25
Link 👇
Reposted by Alessio Devoto
💡 We compare prompting (zero and multi-shot + explanations) and inference-time interventions (ActAdd, REFT and SAEs).

Following SpARE (@yuzhaouoe.bsky.social @alessiodevoto.bsky.social), we propose ✨ contrastive SAE steering ✨ with mutual info to personalize literary MT by tuning latent features 4/
May 23, 2025 at 12:23 PM
Reposted by Alessio Devoto
MMLU-Redux just touched down at #NAACL2025! 🎉
Wish I could be there for our "Are We Done with MMLU?" poster today (9:00-10:30am in Hall 3, Poster Session 7), but visa drama said nope 😅
If anyone's swinging by, give our research some love! Hit me up if you check it out! 👋
May 2, 2025 at 1:00 PM
Reposted by Alessio Devoto
My amazing collaborators will present several works at ICLR and NAACL later this month -- please catch up with them if you're attending! I tried to summarise our recent work in a blog post: neuralnoise.com/2025/march-r...
April 19, 2025 at 8:15 AM
Reposted by Alessio Devoto
Please share it within your circles! edin.ac/3DDQK1o
March 13, 2025 at 11:59 AM
Reposted by Alessio Devoto
🚀 New Paper Alert! 🚀

We introduce Q-Filters, a training-free method for efficient KV Cache compression!

It is compatible with FlashAttention and can compress along generation which is particularly useful for reasoning models ⚡

TLDR: we make Streaming-LLM smarter using the geometry of attention
March 6, 2025 at 4:02 PM
Reposted by Alessio Devoto
Live from the CoLoRAI workshop at AAAI
(april-tools.github.io/colorai/)

Nadav Cohen is now giving his talk on "What Makes Data Suitable for Deep Learning?"

Tools from quantum physics are shown to be useful in building more expressive deep learning models by changing the data distribution.
March 4, 2025 at 2:54 PM
Reposted by Alessio Devoto
2018: Saliency maps give plausible interpretations of random weights, triggering skepticism and catalyzing the mechinterp cultural movement, which now advocates for SAEs.

2025: SAEs give plausible interpretations of random weights, triggering skepticism and ...
March 3, 2025 at 6:42 PM
Reposted by Alessio Devoto
Graphical tensor notation for interpretability

www.lesswrong.com/posts/BQKKQi...
February 23, 2025 at 9:16 PM
Reposted by Alessio Devoto
Introducing The AI CUDA Engineer: An agentic AI system that automates the production of highly optimized CUDA kernels.

sakana.ai/ai-cuda-engi...

The AI CUDA Engineer can produce highly optimized CUDA kernels, reaching 10-100x speedup over common machine learning operations in PyTorch.

Examples:
February 20, 2025 at 1:50 AM
Reposted by Alessio Devoto
It's 2025, and I’ve finally updated my Python setup guide to use uv + venv instead of conda + pip!
Here's my go-to recommendation for uv + venv in Python projects for faster installs, better dependency management: github.com/rasbt/LLMs-f...
(Any additional suggestions?)
February 15, 2025 at 7:14 PM
Cool research on how models memorize data 📝 : The 'Manifold Memorization Hypothesis' by Brendan Ross, Hamidreza Kamkariet al. suggests memorization occurs when the model's learned manifold matches the true data manifold but with too small 'local intrinsic dimensionality'.
February 5, 2025 at 4:11 PM
Massive activations & weights in LLMs, two cool works 🤓:

- The Super Weight: finds performance can be totally degraded when pruning a *single* weight - Mengxia Yu et al.

- Massive Activations in LLM:finds some (crucial) activations have very high norm irrespective of context - Mingjie Sun et al.
February 4, 2025 at 5:55 PM
Reposted by Alessio Devoto
On the last day before the Spring Festival holiday in China, DeepSeek released a NEW work on @hf.co 🤯

Janus-Pro🔥 autoregressive framework that unifies multimodal understanding and generation
huggingface.co/deepseek-ai/...
✨ 1B / 7B
✨ MIT License
deepseek-ai/Janus-Pro-1B · Hugging Face
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
huggingface.co
January 27, 2025 at 5:36 PM
Reposted by Alessio Devoto
Not only that, but much of the science community here is already stronger and larger than it was on X.

On Twitter, my feed of scientists who study climate-related topics topped out at 3300. Here, we’re at 4500 already and it’s still growing.

Pin here: bsky.app/profile/did:...
January 24, 2025 at 8:26 PM
Reposted by Alessio Devoto
*MoE Graph Transformers for Interpretable Particle Collision Detection*
by @alessiodevoto.bsky.social @sgiagu.bsky.social et al.

We propose a MoE graph transformer for particle collision analysis, with many nice interpretability insights (e.g., expert specialization).

arxiv.org/abs/2501.03432
January 10, 2025 at 2:12 PM
Reposted by Alessio Devoto
deepseek GGUF just dropped, if you have 207GB disk/40GB RAM for the smallest version huggingface.co/collections/...
Deepseek V3 (All Versions) - a unsloth Collection
Deepseek V3 - available in bf16, original, and GGUF formats, with support for 2, 3, 4, 5, 6 and 8-bit quantized versions.
huggingface.co
January 8, 2025 at 12:20 AM
LLMs inner representations🔬
Llamas Work in English: LLMs default to English-based concept representations, regardless of input language @wendlerc.bsky.social et al
Semantic Hub: Multimodal models create a single shared semantic space, structured by their primary language @zhaofengwu.bsky.social et a
December 22, 2024 at 5:57 PM
Reposted by Alessio Devoto
I'll get straight to the point.

We trained 2 new models. Like BERT, but modern. ModernBERT.

Not some hypey GenAI thing, but a proper workhorse model, for retrieval, classification, etc. Real practical stuff.

It's much faster, more accurate, longer context, and more useful. 🧵
December 19, 2024 at 4:45 PM
In Vision & Audio transformers, not all tokens need the same compute resources! We propose “modular learners” to control compute at token-level granularity (MHA & MLP): hard tokens get more, easy ones get less!
w/ @sscardapane.bsky.social @neuralnoise.com @bartoszWojcik
Soon #AAAI25
Link 👇
December 19, 2024 at 4:33 PM
Reposted by Alessio Devoto
Josh Tenenbaum on scaling up vs growing up and the path to human-like reasoning #NeurIPS2024
December 15, 2024 at 6:14 PM
Reposted by Alessio Devoto
*Adaptive Computation Modules: Granular Conditional Computation For Efficient Inference*
with @alessiodevoto.bsky.social @neuralnoise.com

Happy to share our work on distilling efficient transformers with dynamic modules' activation was accepted at #AAAI2025. 🔥

arxiv.org/abs/2312.10193
December 11, 2024 at 2:25 PM
Reposted by Alessio Devoto
*Sparse Crosscoders for Cross-Layer Features and Model Diffing*
by @colah.bsky.social @anthropic.com

Investigates stability & dynamics of "interpretable features" with cross-layers SAEs. Can also be used to investigate differences in fine-tuned models.

transformer-circuits.pub/2024/crossco...
December 6, 2024 at 2:02 PM
Very cool work! 👏🚀 Unfortunately, errors in the original dataset will propagate to all new languages 😕

We investigated the issue of existing errors in the original MMLU in
arxiv.org/abs/2406.04127

@aryopg.bsky.social @neuralnoise.com
Is MMLU Western-centric? 🤔

As part of a massive cross-institutional collaboration:
🗽Find MMLU is heavily overfit to western culture
🔍 Professional annotation of cultural sensitivity data
🌍 Release improved Global-MMLU 42 languages

📜 Paper: arxiv.org/pdf/2412.03304
📂 Data: hf.co/datasets/Coh...
December 6, 2024 at 1:57 PM