Sebastian Bordt
sbordt.bsky.social
Sebastian Bordt
@sbordt.bsky.social
Language models and interpretable machine learning. Postdoc @ Uni Tübingen.

https://sbordt.github.io/
July 20, 2025 at 8:54 PM
During the last couple of years, we have read a lot of papers on explainability and often felt that something was fundamentally missing🤔

This led us to write a position paper (accepted at #ICML2025) that attempts to identify the problem and to propose a solution.

arxiv.org/abs/2402.02870
👇🧵
July 10, 2025 at 5:58 PM
🔧What are the reasons for the forgetting? We highlight one important factor: The weight decay parameter of AdamW. Concretely, we show that forgetting data contamination always occurs at least as fast as the decay of past gradients in AdamW.
July 8, 2025 at 6:45 AM
🧠 The mechanism? Forgetting dynamics! As the model contines training, texts that were seen earlier become less and less important. As we illustrate with OLMo-7B, this effect persists in fairly large models.
July 8, 2025 at 6:45 AM
🚀But modern LLMs are not Chinchilla-optimal, they are trained on significantly more tokens. And as the overall size of the training data increases, the impact of contamination starts to decrease.
July 8, 2025 at 6:44 AM
Have you ever wondered whether a few times of data contamination really lead to benchmark overfitting?🤔 Then our latest #ICML paper about the effect of data contamination on LLM evals might be for you!🚀

Paper: arxiv.org/abs/2410.03249
👇🧵
July 8, 2025 at 6:42 AM
I really like the new HTML preview on arxiv, but it somehow handles latex errors differently from PDF. I've been seeing lots of ICML error messages lately.
March 13, 2025 at 11:00 AM
can you draw me a dragon in tikz
February 28, 2025 at 10:50 AM
The chain of thought in DeepSeek-R1 is pretty impressive.
January 20, 2025 at 9:35 PM
What is the reason for the forgetting?

The phenomenon is complex and requires more investigation.

However, in large-scale training runs, weight decay plays an important role.

This leads to a fun little theory of example forgetting via cumulative weight decay - check the paper for details! :)
December 14, 2024 at 8:12 PM
The results for OLMo-7B are still preliminary and not yet in the pre-print. But you can find them on the poster!
December 14, 2024 at 8:09 PM
We then scale up our experiments by contaminating intermediate checkpoints of OLMo-1B and OLMo-7B.

Immediately after contamination, this leads to strong benchmark overfitting.

Surprisingly, as we continue training, almost all of the contamination is forgotten!
December 14, 2024 at 8:07 PM
At the same time, even 32x repeated contamination can be forgotten if the data is scaled beyond 5 times Chinchilla - the regime of many modern LLMs.
December 14, 2024 at 8:03 PM
By training small models from scratch, we find that the effect of contamination strongly depends on the scale of the data.

If the data follows the Chinchilla scaling law (20x model parameters), minor contamination leads to overfitting.
December 14, 2024 at 8:00 PM
Are you interested in data contamination and LLM benchmarks?🤖

Check out our poster today at the NeurIPS ATTRIB workshop (3-4:30pm)!

💡 TL;DR: In the large-data regime, a few times of data contamination matter less than you might think.
December 14, 2024 at 7:53 PM