https://belinkov.com/
🚀 New paper out: ManagerBench: Evaluating the Safety-Pragmatism Trade-off in Autonomous LLMs🚀🧵
🚀 New paper out: ManagerBench: Evaluating the Safety-Pragmatism Trade-off in Autonomous LLMs🚀🧵
Cognitive biases, hidden knowledge, CoT faithfulness, model editing, and LM4Science
See the thread for details and reach out if you'd like to discuss more!
Cognitive biases, hidden knowledge, CoT faithfulness, model editing, and LM4Science
See the thread for details and reach out if you'd like to discuss more!
We’ve added more recent work and more immediately actionable directions for future work. Now published in Computational Linguistics!
We’ve added more recent work and more immediately actionable directions for future work. Now published in Computational Linguistics!
* PhD applications direct or via ELLIS @ellis.eu (ellis.eu/news/ellis-p...)
* Post-doc applications direct or via Azrieli (azrielifoundation.org/fellows/inte...) or Zuckerman (zuckermanstem.org/ourprograms/...)
* PhD applications direct or via ELLIS @ellis.eu (ellis.eu/news/ellis-p...)
* Post-doc applications direct or via Azrieli (azrielifoundation.org/fellows/inte...) or Zuckerman (zuckermanstem.org/ourprograms/...)
Get in touch if you're in the Boston area and want to chat about anything related to AI interpretability, robustness, interventions, safety, multi-modality, protein/DNA LMs, new architectures, multi-agent communication, or anything else you're excited about!
We’re thrilled to welcome Yonatan Belinkov (expert in #NLP) and Daphna Weinshall (expert in human & machine vision) as visiting scholars for the 2025–26 academic year.
📖 Read more: bit.ly/47QkDID
#AI #MachineVision @boknilev.bsky.social
Get in touch if you're in the Boston area and want to chat about anything related to AI interpretability, robustness, interventions, safety, multi-modality, protein/DNA LMs, new architectures, multi-agent communication, or anything else you're excited about!
speaking at
@kempnerinstitute.bsky.social
on Auditing, Dissecting, and Evaluating LLMs
speaking at
@kempnerinstitute.bsky.social
on Auditing, Dissecting, and Evaluating LLMs
In case you can’t wait so long to hear about it in person, it will also be presented as an oral at @interplay-workshop.bsky.social @colmweb.org 🥳
FUR is a parametric test assessing whether CoTs faithfully verbalize latent reasoning.
In case you can’t wait so long to hear about it in person, it will also be presented as an oral at @interplay-workshop.bsky.social @colmweb.org 🥳
FUR is a parametric test assessing whether CoTs faithfully verbalize latent reasoning.
The workshop is highly attended and is a great exposure for your finished work or feedback on work in progress.
#emnlp2025 at Sujhou, China!
#BlackboxNLP 2025 invites the submission of archival and non-archival papers on interpreting and explaining NLP models.
📅 Deadlines: Aug 15 (direct submissions), Sept 5 (ARR commitment)
🔗 More details: blackboxnlp.github.io/2025/call/
The workshop is highly attended and is a great exposure for your finished work or feedback on work in progress.
#emnlp2025 at Sujhou, China!
discord.gg/n5uwjQcxPR
Participants will have the option to write a system description paper that gets published.
discord.gg/n5uwjQcxPR
Participants will have the option to write a system description paper that gets published.
New featurization methods?
Circuit pruning?
Better feature attribution?
We'd love to hear about it 👇
New featurization methods?
Circuit pruning?
Better feature attribution?
We'd love to hear about it 👇
Consider submitting your work to the MIB Shared Task, part of this year’s #BlackboxNLP
We welcome submissions of both existing methods and new or experimental POCs!
Consider submitting your work to the MIB Shared Task, part of this year’s #BlackboxNLP
We welcome submissions of both existing methods and new or experimental POCs!
In a new project led by Yaniv (@YNikankin on the other app), we investigate this gap from an mechanistic perspective, and use our findings to close a third of it! 🧵
In a new project led by Yaniv (@YNikankin on the other app), we investigate this gap from an mechanistic perspective, and use our findings to close a third of it! 🧵
Check out our new preprint - "SAEs Are Good for Steering - If You Select the Right Features" 🧵
Check out our new preprint - "SAEs Are Good for Steering - If You Select the Right Features" 🧵
REVS: Unlearning Sensitive Information in LMs via Rank Editing in the Vocabulary Space.
LMs memorize and leak sensitive data—emails, SSNs, URLs from their training.
We propose a surgical method to unlearn it.
🧵👇w/ @boknilev.bsky.social @mtutek.bsky.social
1/8
REVS: Unlearning Sensitive Information in LMs via Rank Editing in the Vocabulary Space.
LMs memorize and leak sensitive data—emails, SSNs, URLs from their training.
We propose a surgical method to unlearn it.
🧵👇w/ @boknilev.bsky.social @mtutek.bsky.social
1/8
This edition will feature a new shared task on circuits/causal variable localization in LMs, details here: blackboxnlp.github.io/2025/task
Come talk to us at #iclr2025 and consider submitting to the leaderboard.
We’re also planning a shared task around it at #blackboxNLP this year, located with #emnlp2025
We propose 😎 𝗠𝗜𝗕: a 𝗠echanistic 𝗜nterpretability 𝗕enchmark!
Come talk to us at #iclr2025 and consider submitting to the leaderboard.
We’re also planning a shared task around it at #blackboxNLP this year, located with #emnlp2025
🌐 Website: mib-bench.github.io
📄 Data: huggingface.co/collections/...
💻 Code: github.com/aaronmueller...
📊 Leaderboard: Coming very soon!
🌐 Website: mib-bench.github.io
📄 Data: huggingface.co/collections/...
💻 Code: github.com/aaronmueller...
📊 Leaderboard: Coming very soon!
We propose a method to automatically find position-aware circuits, improving faithfulness while keeping circuits compact. 🧵👇
We propose a method to automatically find position-aware circuits, improving faithfulness while keeping circuits compact. 🧵👇
I look forward to further works on the parametric faithfulness route!
Codebase (& data): github.com/technion-cs-...
I look forward to further works on the parametric faithfulness route!
Codebase (& data): github.com/technion-cs-...
LLMs can hallucinate - but did you know they can do so with high certainty even when they know the correct answer? 🤯
We find those hallucinations in our latest work with @itay-itzhak.bsky.social, @fbarez.bsky.social, @gabistanovsky.bsky.social and Yonatan Belinkov
LLMs can hallucinate - but did you know they can do so with high certainty even when they know the correct answer? 🤯
We find those hallucinations in our latest work with @itay-itzhak.bsky.social, @fbarez.bsky.social, @gabistanovsky.bsky.social and Yonatan Belinkov
In a new paper, @amuuueller.bsky.social and I use mech interp tools to study how LMs process structurally ambiguous sentences. We show LMs rely on both syntactic & spurious features! 1/10
In a new paper, @amuuueller.bsky.social and I use mech interp tools to study how LMs process structurally ambiguous sentences. We show LMs rely on both syntactic & spurious features! 1/10