Andreas Waldis
tresiwald.bsky.social
Andreas Waldis
@tresiwald.bsky.social
Behavioral and Internal Interpretability 🔎
Incoming PostDoc Tübingen University | PhD Student at @ukplab.bsky.social, TU Darmstadt/Hochschule Luzern
LMs that "know more" about toxicity are less toxic!
Our #TACL 📄 connects behavior and internals:
💠 LMs amplify toxicity beyond humans
💠 Information about toxicity peaks in lower layers
💠 Bypassing these layers increases toxicity
More details👇 #NLProc #interpretability (1/🧵)
January 27, 2026 at 1:01 PM
Reposted by Andreas Waldis
✨ The schedule for our INTERPLAY workshop at COLM is live! ✨
🗓️ October 10th, Room 518C
🔹 Invited talks from @sarah-nlp.bsky.social John Hewitt @amuuueller.bsky.social @kmahowald.bsky.social
🔹 Paper presentations and posters
🔹 Closing roundtable discussion.

Join us in Montréal! @colmweb.org
October 9, 2025 at 5:30 PM
Reposted by Andreas Waldis
Missed a spot? If you have a pre-reviewed paper from ARR or COLM that focuses on the INTERPLAY between LM internals and behavior, there is a shortcut to presenting at our @colmweb.org workshop! ✨
Join us in Montréal! 🇨🇦

CfP: shorturl.at/sBomu
OpenReview: shorturl.at/WwWhg

#nlproc #interpretability
July 8, 2025 at 9:06 AM
Reposted by Andreas Waldis
Delighted that ✨Mor Geva (@megamor2.bsky.social) and ✨Anna Ivanova (@neuranna.bsky.social) will complete our speaker line-up and talk about the INTERPLAY of model internals and behavior.

Be there and submit by June 30th 📄
shorturl.at/sBomu

See you in 🇨🇦 @colmweb.org
#nlproc #interpretability
June 24, 2025 at 1:06 PM