Dana Arad
danaarad.bsky.social
Dana Arad
@danaarad.bsky.social
NLP Researcher | CS PhD Candidate @ Technion
Pinned
Tried steering with SAEs and found that not all features behave as expected?

Check out our new preprint - "SAEs Are Good for Steering - If You Select the Right Features" 🧵
Now accepted to EMNLP Main Conference!
Tried steering with SAEs and found that not all features behave as expected?

Check out our new preprint - "SAEs Are Good for Steering - If You Select the Right Features" 🧵
August 20, 2025 at 7:38 PM
Submit your work to #BlackboxNLP 2025!
📢 Call for Papers! 📢
#BlackboxNLP 2025 invites the submission of archival and non-archival papers on interpreting and explaining NLP models.

📅 Deadlines: Aug 15 (direct submissions), Sept 5 (ARR commitment)
🔗 More details: blackboxnlp.github.io/2025/call/
August 12, 2025 at 7:13 PM
Excited to spend the rest of the summer visiting @davidbau.bsky.social's lab at Northeastern! If you’re in the area and want to chat about interpretability, let me know ☕️
August 10, 2025 at 1:56 PM
Reposted by Dana Arad
In Vienna for #ACL2025, and already had my first (vegan) Austrian sausage!

Now hungry for discussing:
– LLMs behavior
– Interpretability
– Biases & Hallucinations
– Why eval is so hard (but so fun)
Come say hi if that’s your vibe too!
July 27, 2025 at 6:11 AM
10 days to go! Still time to run your method and submit!
Just 10 days to go until the results submission deadline for the MIB Shared Task at #BlackboxNLP!

If you're working on:
🧠 Circuit discovery
🔍 Feature attribution
🧪 Causal variable localization
now’s the time to polish and submit!

Join us on Discord: discord.gg/n5uwjQcxPR
July 23, 2025 at 8:21 AM
Three weeks is plenty of time to submit your method!
⏳ Three weeks left! Submit your work to the MIB Shared Task at #BlackboxNLP, co-located with @emnlpmeeting.bsky.social

Whether you're working on circuit discovery or causal variable localization, this is your chance to benchmark your method in a rigorous setup!
July 13, 2025 at 6:11 AM
What are you working on for the MIB shared task?

Check out the full task description here: blackboxnlp.github.io/2025/task/
July 9, 2025 at 7:21 AM
Reposted by Dana Arad
New to mechanistic interpretability?
The MIB shared task is a great opportunity to experiment:
✅ Clean setup
✅ Open baseline code
✅ Standard evaluation

Join the discord server for ideas and discussions: discord.gg/n5uwjQcxPR
July 7, 2025 at 8:42 AM
VLMs perform better on questions about text than when answering the same questions about images - but why? and how can we fix it?

In a new project led by Yaniv (@YNikankin on the other app), we investigate this gap from an mechanistic perspective, and use our findings to close a third of it! 🧵
June 26, 2025 at 10:41 AM
Reposted by Dana Arad
Working on circuit discovery in LMs?
Consider submitting your work to the MIB Shared Task, part of #BlackboxNLP at @emnlpmeeting.bsky.social 2025!

The goal: benchmark existing MI methods and identify promising directions to precisely and concisely recover causal pathways in LMs >>
June 24, 2025 at 2:24 PM
Reposted by Dana Arad
Have you heard about this year's shared task? 📢

Mechanistic Interpretability (MI) is quickly advancing, but comparing methods remains a challenge. This year at #BlackboxNLP, we're introducing a shared task to rigorously evaluate MI methods in language models 🧵
June 23, 2025 at 2:46 PM
Reposted by Dana Arad
SAEs have been found to massively underperform supervised methods for steering neural networks.

In new work led by @danaarad.bsky.social, we find that this problem largely disappears if you select the right features!
Tried steering with SAEs and found that not all features behave as expected?

Check out our new preprint - "SAEs Are Good for Steering - If You Select the Right Features" 🧵
May 27, 2025 at 5:07 PM
Tried steering with SAEs and found that not all features behave as expected?

Check out our new preprint - "SAEs Are Good for Steering - If You Select the Right Features" 🧵
May 27, 2025 at 4:06 PM
Reposted by Dana Arad
Lots of progress in mech interp (MI) lately! But how can we measure when new mech interp methods yield real improvements over prior work?

We propose 😎 𝗠𝗜𝗕: a 𝗠echanistic 𝗜nterpretability 𝗕enchmark!
April 23, 2025 at 6:15 PM
Reposted by Dana Arad
🚨🚨 New preprint 🚨🚨

Ever wonder whether verbalized CoTs correspond to the internal reasoning process of the model?

We propose a novel parametric faithfulness approach, which erases information contained in CoT steps from the model parameters to assess CoT faithfulness.

arxiv.org/abs/2502.14829
Measuring Faithfulness of Chains of Thought by Unlearning Reasoning Steps
When prompted to think step-by-step, language models (LMs) produce a chain of thought (CoT), a sequence of reasoning steps that the model supposedly used to produce its prediction. However, despite mu...
arxiv.org
February 21, 2025 at 12:43 PM
Reposted by Dana Arad
🚨New arXiv preprint!🚨
LLMs can hallucinate - but did you know they can do so with high certainty even when they know the correct answer? 🤯
We find those hallucinations in our latest work with @itay-itzhak.bsky.social, @fbarez.bsky.social, @gabistanovsky.bsky.social and Yonatan Belinkov
February 19, 2025 at 3:50 PM
Reposted by Dana Arad
If you’re interested in mechanistic interpretability, I just found this starter pack and wanted to boost it (thanks for creating it @butanium.bsky.social !). Excited to have a mech interp community on bluesky 🎉

go.bsky.app/LisK3CP
November 19, 2024 at 12:28 AM
Reposted by Dana Arad
A starter pack for #NLP #NLProc researchers! 🎉

go.bsky.app/SngwGeS
November 4, 2024 at 10:01 AM