Ambroise Odonnat
ambroiseodt.bsky.social
Ambroise Odonnat
@ambroiseodt.bsky.social
Ph.D. student in Machine Learning at Inria.
Website: https://ambroiseodt.github.io/
Blog: https://logb-research.github.io
🚀 We are happy to organize the BERT²S workshop @neuripsconf.bsky.social 2025 on Recent Advances in Time Series Foundation Models.
🌐 berts-workshop.github.io
📜Submit by August 22
🎓Speakers and panelists: Chenghao Liu, Mingsheng Long, Zoe Piran, Danielle C. Maddix, Ameet Talwalkar, Qingsong Wen
July 22, 2025 at 2:41 PM
🤗Thanks a lot @haeggee.bsky.social and @mjaggi.bsky.social for having me in the MLO group at EPFL @icepfl.bsky.social to present "Large Language Models as Markov Chains".

Slides are available on my website (link in thread).

🎉 New experiments with Llama and Gemma models in the updated paper!
February 28, 2025 at 1:03 PM
Finally, I can't thank you enough Wes and @viviencabannes.bsky.social for this collab: you are a rare combination of super-smart and fun to work with!

Hopefully, more to come soon🤠

"Moi, si je devais résumer ma vie aujourd’hui avec vous, je dirais que c’est d’abord des rencontres."
February 4, 2025 at 11:56 AM
From the theoretical side, we show that clustering heads can be learned via gradient descent and provide theoretical insights into the two-stage learning observed in practice.
6/🧵
February 4, 2025 at 11:56 AM
We investigate loss spikes, suggesting potential strategies for mitigation, which could lead to more stable training processes. We also peek into the transferability of circuits to showcase the usefulness of curriculum learning and data curation.
5/🧵
February 4, 2025 at 11:56 AM
In the second, we unveil "𝑪𝒍𝒖𝒔𝒕𝒆𝒓𝒊𝒏𝒈 𝑯𝒆𝒂𝒅𝒔", circuits that learn the invariance of the task. Their training dynamic is in two phases: 1) clustering of the attention embeddings according to invariance and 2) classifier fitting.
4/🧵
February 4, 2025 at 11:56 AM
In the first paper, we show how GD (gradient descent) reinforces useful circuits in transformers while pruning others to create sub-circuits that help solve complex tasks by breaking them down into intermediate reasoning steps.

3/🧵
February 4, 2025 at 11:56 AM
We consider the 𝒔𝒑𝒂𝒓𝒔𝒆 𝒎𝒐𝒅𝒖𝒍𝒂𝒓 𝒂𝒅𝒅𝒊𝒕𝒊𝒐𝒏 problem where the inputs are sequences of L tokens in the ring of integers modulo p and the corresponding targets are the sum of the first k terms modulo p. Formally, we aim to learn the mapping:

2/🧵
February 4, 2025 at 11:56 AM
🚀Proud to share our work on the training dynamics in Transformers with Wassim Bouaziz & @viviencabannes.bsky.social @Inria @MetaAI

📝Easing Optimization Paths arxiv.org/pdf/2501.02362 (accepted @ICASSP 2025 🥳)

📝Clustering Heads 🔥https://arxiv.org/pdf/2410.24050

🖥️ github.com/facebookrese...

1/🧵
February 4, 2025 at 11:56 AM
🎤Presenting our work on Unsupervised Accuracy Estimation at #NeurIPS2024 this week!

✋🏾Poster Session 4 West - on Thu. at 4:30 pm

📍 Poster #4310 - East Exhibit Hall A-C

DM me if you'd like to chat :)
December 10, 2024 at 2:44 PM
🤗This is joint work with Renchunzi Xie, Vasilii Feofanov, Weijian Deng, Jianfeng Zhang, and Bo An.

Finally, I want to thank @ramealexandre.bsky.social Youssef Attia El Hili for fruitful discussions during the elaboration of this work.

🧵/🧵
December 3, 2024 at 4:58 PM
🥳Finally the awaited surprise!
Our work includes a result akin to the one of
@petar-v.bsky.social in “softmax is not sharp enough” (arxiv.org/pdf/2410.01104). We discuss its implications in the context of unsupervised accuracy estimation.

12/🧵
December 3, 2024 at 4:58 PM
Last but not least, we discuss in great detail the limitations of our approach and how to formalize prediction bias in unsupervised settings. We believe this is a missing piece in the current literature and hope our work can be a first step toward bridging this gap.

11/🧵
December 3, 2024 at 4:58 PM
We also qualitatively demonstrate the superiority of our approach.

10/🧵
December 3, 2024 at 4:58 PM
We obtain SOTA performance for various shifts (subpopulation, synthetic, natural) and architectures (ResNet, ConvNext, and Vision Transformers).

9/🧵
December 3, 2024 at 4:58 PM
Thus, we truncate the exponential when the model is not calibrated. As we cannot access test labels, we provide a criterion to select the proper normalization to use automatically: softmax or Taylor. This boils down to a simple three-step recipe:

8/🧵
December 3, 2024 at 4:58 PM
Here’s where it gets tricky! How do you normalize the logits? Simply using the softmax is bad as it's overconfident (see arxiv.org/pdf/2310.14814 and arxiv.org/pdf/2205.09310). We even show that it accumulates prediction bias in miscalibrated scenarios.

7/🧵
December 3, 2024 at 4:58 PM
We demonstrate that 𝐌𝐚𝐍𝐨 captures the model’s uncertainty, which makes theory and practice perfectly balanced, as all things should be!

6/🧵
December 3, 2024 at 4:58 PM
As logits have different ranges, one must normalize them before computing the margins. Once it’s done, you aggregate the margins following the low-density assumption, and that’s it (well, almost…).

5/🧵
December 3, 2024 at 4:58 PM
By reviving the (old) low-density separation assumption (see Olivier Chappelle’s book), we demonstrate that logits reflect distance to decision boundaries and thus correlate with test performance.

4/🧵
December 3, 2024 at 4:58 PM
Most methods use the model’s logits, without proper justification, and rely on the softmax despite its overconfidence issues. This motivates us to ask:

3/🧵
December 3, 2024 at 4:58 PM
Unsupervised Accuracy Estimation is challenging as:

1) No access to pre-training data,
2) Unlabeled test data,
3) Distribution shift between training and test data.

2/🧵
December 3, 2024 at 4:58 PM
🚨So, you want to predict your model's performance at test time?🚨

💡Our NeurIPS 2024 paper proposes 𝐌𝐚𝐍𝐨, a training-free and SOTA approach!

📑 arxiv.org/pdf/2405.18979
🖥️https://github.com/Renchunzi-Xie/MaNo

1/🧵(A surprise at the end!)
December 3, 2024 at 4:58 PM