Lightnews — Scholar-powered news

Reposted by Yonatan Belinkov ✈️ COLM2025

Martin Tutek

@mtutek.bsky.social

🤔What happens when LLM agents choose between achieving their goals and avoiding harm to humans in realistic management scenarios? Are LLMs pragmatic or prefer to avoid human harm?

🚀 New paper out: ManagerBench: Evaluating the Safety-Pragmatism Trade-off in Autonomous LLMs🚀🧵

October 8, 2025 at 3:14 PM

Yonatan Belinkov ✈️ COLM2025

@boknilev.bsky.social

Traveling to #COLM2025 this week, and here's some work from our group and collaborators:
Cognitive biases, hidden knowledge, CoT faithfulness, model editing, and LM4Science
See the thread for details and reach out if you'd like to discuss more!

October 7, 2025 at 1:41 PM

Reposted by Yonatan Belinkov ✈️ COLM2025

Aaron Mueller

@amuuueller.bsky.social

What's the right unit of analysis for understanding LLM internals? We explore in our mech interp survey (a major update from our 2024 ms).

We’ve added more recent work and more immediately actionable directions for future work. Now published in Computational Linguistics!

October 1, 2025 at 2:03 PM

Yonatan Belinkov ✈️ COLM2025

@boknilev.bsky.social

Opportunities to join my group in fall 2026:
* PhD applications direct or via ELLIS @ellis.eu (ellis.eu/news/ellis-p...)
* Post-doc applications direct or via Azrieli (azrielifoundation.org/fellows/inte...) or Zuckerman (zuckermanstem.org/ourprograms/...)

October 1, 2025 at 1:44 PM

Yonatan Belinkov ✈️ COLM2025

@boknilev.bsky.social

Excited to join @KempnerInst this year!
Get in touch if you're in the Boston area and want to chat about anything related to AI interpretability, robustness, interventions, safety, multi-modality, protein/DNA LMs, new architectures, multi-agent communication, or anything else you're excited about!

Kempner Institute at Harvard University @kempnerinstitute.bsky.social · Sep 22

News from the #KempnerInstitute!

We’re thrilled to welcome Yonatan Belinkov (expert in #NLP) and Daphna Weinshall (expert in human & machine vision) as visiting scholars for the 2025–26 academic year.

📖 Read more: bit.ly/47QkDID

#AI #MachineVision @boknilev.bsky.social

Kempner Institute Welcomes Yonatan Belinkov and Daphna Weinshall as Visiting Scholars for the 2025-2026 Academic Year - Kempner Institute

The Kempner Institute is pleased to welcome Yonatan Belinkov and Daphna Weinshall as visiting scholars for the 2025-26 academic year. Both are preeminent researchers in the field of intelligence, help...

bit.ly

September 22, 2025 at 6:41 PM

Yonatan Belinkov ✈️ COLM2025

@boknilev.bsky.social

@robinjia.bsky.social
speaking at
@kempnerinstitute.bsky.social
on Auditing, Dissecting, and Evaluating LLMs

September 18, 2025 at 5:22 PM

Reposted by Yonatan Belinkov ✈️ COLM2025

Martin Tutek

@mtutek.bsky.social

Thrilled that FUR was accepted to @emnlpmeeting.bsky.social Main🎉

In case you can’t wait so long to hear about it in person, it will also be presented as an oral at @interplay-workshop.bsky.social @colmweb.org 🥳

FUR is a parametric test assessing whether CoTs faithfully verbalize latent reasoning.

August 21, 2025 at 3:21 PM

Yonatan Belinkov ✈️ COLM2025

@boknilev.bsky.social

BlackboxNLP is the workshop on interpreting and analyzing NLP models (including LLMs, VLMs, etc). We accept full (archival) papers and extended abstracts.

The workshop is highly attended and is a great exposure for your finished work or feedback on work in progress.

#emnlp2025 at Sujhou, China!

BlackboxNLP @blackboxnlp.bsky.social · Aug 12

📢 Call for Papers! 📢
#BlackboxNLP 2025 invites the submission of archival and non-archival papers on interpreting and explaining NLP models.

📅 Deadlines: Aug 15 (direct submissions), Sept 5 (ARR commitment)
🔗 More details: blackboxnlp.github.io/2025/call/

August 12, 2025 at 7:16 PM

Yonatan Belinkov ✈️ COLM2025

@boknilev.bsky.social

Join our Discord for discussions and a bunch of simple submission ideas you can try!
discord.gg/n5uwjQcxPR

Participants will have the option to write a system description paper that gets published.

July 13, 2025 at 5:44 PM

Reposted by Yonatan Belinkov ✈️ COLM2025

BlackboxNLP

@blackboxnlp.bsky.social

Have you started working on your submission for the MIB shared task yet? Tell us what you’re exploring!

New featurization methods?
Circuit pruning?
Better feature attribution?

We'd love to hear about it 👇

July 9, 2025 at 7:15 AM

Reposted by Yonatan Belinkov ✈️ COLM2025

BlackboxNLP

@blackboxnlp.bsky.social

Working on feature attribution, circuit discovery, feature alignment, or sparse coding?
Consider submitting your work to the MIB Shared Task, part of this year’s #BlackboxNLP

We welcome submissions of both existing methods and new or experimental POCs!

July 8, 2025 at 9:35 AM

Reposted by Yonatan Belinkov ✈️ COLM2025

Dana Arad

@danaarad.bsky.social

VLMs perform better on questions about text than when answering the same questions about images - but why? and how can we fix it?

In a new project led by Yaniv (@YNikankin on the other app), we investigate this gap from an mechanistic perspective, and use our findings to close a third of it! 🧵

June 26, 2025 at 10:41 AM

Reposted by Yonatan Belinkov ✈️ COLM2025

Dana Arad

@danaarad.bsky.social

Tried steering with SAEs and found that not all features behave as expected?

Check out our new preprint - "SAEs Are Good for Steering - If You Select the Right Features" 🧵

May 27, 2025 at 4:06 PM

Reposted by Yonatan Belinkov ✈️ COLM2025

tomerashuach.bsky.social

@tomerashuach.bsky.social

🚨New paper at #ACL2025 Findings!
REVS: Unlearning Sensitive Information in LMs via Rank Editing in the Vocabulary Space.
LMs memorize and leak sensitive data—emails, SSNs, URLs from their training.
We propose a surgical method to unlearn it.
🧵👇w/ @boknilev.bsky.social @mtutek.bsky.social
1/8

May 27, 2025 at 8:19 AM

Yonatan Belinkov ✈️ COLM2025

@boknilev.bsky.social

Interested in mechanistic interpretability and care about evaluation? Please consider submitting to our shared task at #blackboxNLP this year!

BlackboxNLP @blackboxnlp.bsky.social · May 15

BlackboxNLP, the leading workshop on interpretability and analysis of language models, will be co-located with EMNLP 2025 in Suzhou this November! 📆

This edition will feature a new shared task on circuits/causal variable localization in LMs, details here: blackboxnlp.github.io/2025/task

May 15, 2025 at 9:57 AM

Reposted by Yonatan Belinkov ✈️ COLM2025

Ana Marasović

@anamarasovic.bsky.social

Slides available here: docs.google.com/presentation...

May 4, 2025 at 6:02 PM

Yonatan Belinkov ✈️ COLM2025

@boknilev.bsky.social

Excited about the release of MIB, a mechanistic Interpretability benchmark!

Come talk to us at #iclr2025 and consider submitting to the leaderboard.

We’re also planning a shared task around it at #blackboxNLP this year, located with #emnlp2025

Aaron Mueller @amuuueller.bsky.social · Apr 23

Lots of progress in mech interp (MI) lately! But how can we measure when new mech interp methods yield real improvements over prior work?

We propose 😎 𝗠𝗜𝗕: a 𝗠echanistic 𝗜nterpretability 𝗕enchmark!

April 24, 2025 at 2:20 AM

Reposted by Yonatan Belinkov ✈️ COLM2025

Aaron Mueller

@amuuueller.bsky.social

We release many public resources, including:

🌐 Website: mib-bench.github.io
📄 Data: huggingface.co/collections/...
💻 Code: github.com/aaronmueller...
📊 Leaderboard: Coming very soon!

MIB – Project Page

mib-bench.github.io

April 23, 2025 at 6:15 PM

Reposted by Yonatan Belinkov ✈️ COLM2025

Tal Haklay

@talhaklay.bsky.social

1/13 LLM circuits tell us where the computation happens inside the model—but the computation varies by token position, a key detail often ignored!
We propose a method to automatically find position-aware circuits, improving faithfulness while keeping circuits compact. 🧵👇

March 6, 2025 at 10:15 PM

Reposted by Yonatan Belinkov ✈️ COLM2025

Martin Tutek

@mtutek.bsky.social

If erasing information from CoT steps adversely affects the prediction of the model, indicating that such explanations are parametrically faithful.

February 21, 2025 at 12:43 PM

Reposted by Yonatan Belinkov ✈️ COLM2025

Martin Tutek

@mtutek.bsky.social

It has been amazing to work with @fatemehc.bsky.social, @anamarasovic.bsky.social and Yonatan Belinkov on this incredibly important topic.

I look forward to further works on the parametric faithfulness route!

Codebase (& data): github.com/technion-cs-...

GitHub - technion-cs-nlp/parametric-faithfulness

Contribute to technion-cs-nlp/parametric-faithfulness development by creating an account on GitHub.

github.com

February 21, 2025 at 12:43 PM

Reposted by Yonatan Belinkov ✈️ COLM2025

Adi Simhi

@adisimhi.bsky.social

🚨New arXiv preprint!🚨
LLMs can hallucinate - but did you know they can do so with high certainty even when they know the correct answer? 🤯
We find those hallucinations in our latest work with @itay-itzhak.bsky.social, @fbarez.bsky.social, @gabistanovsky.bsky.social and Yonatan Belinkov

February 19, 2025 at 3:50 PM

Reposted by Yonatan Belinkov ✈️ COLM2025

Michael Hanna

@michaelwhanna.bsky.social

Sentences are partially understood before they're fully read. How do LMs incrementally interpret their inputs?

In a new paper, @amuuueller.bsky.social and I use mech interp tools to study how LMs process structurally ambiguous sentences. We show LMs rely on both syntactic & spurious features! 1/10

December 19, 2024 at 1:40 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news