Dana Arad
@danaarad.bsky.social
NLP Researcher | CS PhD Candidate @ Technion
Pinned
Dana Arad
@danaarad.bsky.social
· May 27
Tried steering with SAEs and found that not all features behave as expected?
Check out our new preprint - "SAEs Are Good for Steering - If You Select the Right Features" 🧵
Check out our new preprint - "SAEs Are Good for Steering - If You Select the Right Features" 🧵
Now accepted to EMNLP Main Conference!
Tried steering with SAEs and found that not all features behave as expected?
Check out our new preprint - "SAEs Are Good for Steering - If You Select the Right Features" 🧵
Check out our new preprint - "SAEs Are Good for Steering - If You Select the Right Features" 🧵
August 20, 2025 at 7:38 PM
Now accepted to EMNLP Main Conference!
Submit your work to #BlackboxNLP 2025!
📢 Call for Papers! 📢
#BlackboxNLP 2025 invites the submission of archival and non-archival papers on interpreting and explaining NLP models.
📅 Deadlines: Aug 15 (direct submissions), Sept 5 (ARR commitment)
🔗 More details: blackboxnlp.github.io/2025/call/
#BlackboxNLP 2025 invites the submission of archival and non-archival papers on interpreting and explaining NLP models.
📅 Deadlines: Aug 15 (direct submissions), Sept 5 (ARR commitment)
🔗 More details: blackboxnlp.github.io/2025/call/
August 12, 2025 at 7:13 PM
Submit your work to #BlackboxNLP 2025!
Excited to spend the rest of the summer visiting @davidbau.bsky.social's lab at Northeastern! If you’re in the area and want to chat about interpretability, let me know ☕️
August 10, 2025 at 1:56 PM
Excited to spend the rest of the summer visiting @davidbau.bsky.social's lab at Northeastern! If you’re in the area and want to chat about interpretability, let me know ☕️
Reposted by Dana Arad
In Vienna for #ACL2025, and already had my first (vegan) Austrian sausage!
Now hungry for discussing:
– LLMs behavior
– Interpretability
– Biases & Hallucinations
– Why eval is so hard (but so fun)
Come say hi if that’s your vibe too!
Now hungry for discussing:
– LLMs behavior
– Interpretability
– Biases & Hallucinations
– Why eval is so hard (but so fun)
Come say hi if that’s your vibe too!
July 27, 2025 at 6:11 AM
In Vienna for #ACL2025, and already had my first (vegan) Austrian sausage!
Now hungry for discussing:
– LLMs behavior
– Interpretability
– Biases & Hallucinations
– Why eval is so hard (but so fun)
Come say hi if that’s your vibe too!
Now hungry for discussing:
– LLMs behavior
– Interpretability
– Biases & Hallucinations
– Why eval is so hard (but so fun)
Come say hi if that’s your vibe too!
10 days to go! Still time to run your method and submit!
Just 10 days to go until the results submission deadline for the MIB Shared Task at #BlackboxNLP!
If you're working on:
🧠 Circuit discovery
🔍 Feature attribution
🧪 Causal variable localization
now’s the time to polish and submit!
Join us on Discord: discord.gg/n5uwjQcxPR
If you're working on:
🧠 Circuit discovery
🔍 Feature attribution
🧪 Causal variable localization
now’s the time to polish and submit!
Join us on Discord: discord.gg/n5uwjQcxPR
July 23, 2025 at 8:21 AM
10 days to go! Still time to run your method and submit!
Three weeks is plenty of time to submit your method!
⏳ Three weeks left! Submit your work to the MIB Shared Task at #BlackboxNLP, co-located with @emnlpmeeting.bsky.social
Whether you're working on circuit discovery or causal variable localization, this is your chance to benchmark your method in a rigorous setup!
Whether you're working on circuit discovery or causal variable localization, this is your chance to benchmark your method in a rigorous setup!
July 13, 2025 at 6:11 AM
Three weeks is plenty of time to submit your method!
What are you working on for the MIB shared task?
Check out the full task description here: blackboxnlp.github.io/2025/task/
Check out the full task description here: blackboxnlp.github.io/2025/task/
July 9, 2025 at 7:21 AM
What are you working on for the MIB shared task?
Check out the full task description here: blackboxnlp.github.io/2025/task/
Check out the full task description here: blackboxnlp.github.io/2025/task/
Reposted by Dana Arad
New to mechanistic interpretability?
The MIB shared task is a great opportunity to experiment:
✅ Clean setup
✅ Open baseline code
✅ Standard evaluation
Join the discord server for ideas and discussions: discord.gg/n5uwjQcxPR
The MIB shared task is a great opportunity to experiment:
✅ Clean setup
✅ Open baseline code
✅ Standard evaluation
Join the discord server for ideas and discussions: discord.gg/n5uwjQcxPR
July 7, 2025 at 8:42 AM
New to mechanistic interpretability?
The MIB shared task is a great opportunity to experiment:
✅ Clean setup
✅ Open baseline code
✅ Standard evaluation
Join the discord server for ideas and discussions: discord.gg/n5uwjQcxPR
The MIB shared task is a great opportunity to experiment:
✅ Clean setup
✅ Open baseline code
✅ Standard evaluation
Join the discord server for ideas and discussions: discord.gg/n5uwjQcxPR
VLMs perform better on questions about text than when answering the same questions about images - but why? and how can we fix it?
In a new project led by Yaniv (@YNikankin on the other app), we investigate this gap from an mechanistic perspective, and use our findings to close a third of it! 🧵
In a new project led by Yaniv (@YNikankin on the other app), we investigate this gap from an mechanistic perspective, and use our findings to close a third of it! 🧵
June 26, 2025 at 10:41 AM
VLMs perform better on questions about text than when answering the same questions about images - but why? and how can we fix it?
In a new project led by Yaniv (@YNikankin on the other app), we investigate this gap from an mechanistic perspective, and use our findings to close a third of it! 🧵
In a new project led by Yaniv (@YNikankin on the other app), we investigate this gap from an mechanistic perspective, and use our findings to close a third of it! 🧵
Reposted by Dana Arad
Working on circuit discovery in LMs?
Consider submitting your work to the MIB Shared Task, part of #BlackboxNLP at @emnlpmeeting.bsky.social 2025!
The goal: benchmark existing MI methods and identify promising directions to precisely and concisely recover causal pathways in LMs >>
Consider submitting your work to the MIB Shared Task, part of #BlackboxNLP at @emnlpmeeting.bsky.social 2025!
The goal: benchmark existing MI methods and identify promising directions to precisely and concisely recover causal pathways in LMs >>
June 24, 2025 at 2:24 PM
Working on circuit discovery in LMs?
Consider submitting your work to the MIB Shared Task, part of #BlackboxNLP at @emnlpmeeting.bsky.social 2025!
The goal: benchmark existing MI methods and identify promising directions to precisely and concisely recover causal pathways in LMs >>
Consider submitting your work to the MIB Shared Task, part of #BlackboxNLP at @emnlpmeeting.bsky.social 2025!
The goal: benchmark existing MI methods and identify promising directions to precisely and concisely recover causal pathways in LMs >>
Reposted by Dana Arad
Have you heard about this year's shared task? 📢
Mechanistic Interpretability (MI) is quickly advancing, but comparing methods remains a challenge. This year at #BlackboxNLP, we're introducing a shared task to rigorously evaluate MI methods in language models 🧵
Mechanistic Interpretability (MI) is quickly advancing, but comparing methods remains a challenge. This year at #BlackboxNLP, we're introducing a shared task to rigorously evaluate MI methods in language models 🧵
June 23, 2025 at 2:46 PM
Have you heard about this year's shared task? 📢
Mechanistic Interpretability (MI) is quickly advancing, but comparing methods remains a challenge. This year at #BlackboxNLP, we're introducing a shared task to rigorously evaluate MI methods in language models 🧵
Mechanistic Interpretability (MI) is quickly advancing, but comparing methods remains a challenge. This year at #BlackboxNLP, we're introducing a shared task to rigorously evaluate MI methods in language models 🧵
Reposted by Dana Arad
SAEs have been found to massively underperform supervised methods for steering neural networks.
In new work led by @danaarad.bsky.social, we find that this problem largely disappears if you select the right features!
In new work led by @danaarad.bsky.social, we find that this problem largely disappears if you select the right features!
Tried steering with SAEs and found that not all features behave as expected?
Check out our new preprint - "SAEs Are Good for Steering - If You Select the Right Features" 🧵
Check out our new preprint - "SAEs Are Good for Steering - If You Select the Right Features" 🧵
May 27, 2025 at 5:07 PM
SAEs have been found to massively underperform supervised methods for steering neural networks.
In new work led by @danaarad.bsky.social, we find that this problem largely disappears if you select the right features!
In new work led by @danaarad.bsky.social, we find that this problem largely disappears if you select the right features!
Tried steering with SAEs and found that not all features behave as expected?
Check out our new preprint - "SAEs Are Good for Steering - If You Select the Right Features" 🧵
Check out our new preprint - "SAEs Are Good for Steering - If You Select the Right Features" 🧵
May 27, 2025 at 4:06 PM
Tried steering with SAEs and found that not all features behave as expected?
Check out our new preprint - "SAEs Are Good for Steering - If You Select the Right Features" 🧵
Check out our new preprint - "SAEs Are Good for Steering - If You Select the Right Features" 🧵
Reposted by Dana Arad
Lots of progress in mech interp (MI) lately! But how can we measure when new mech interp methods yield real improvements over prior work?
We propose 😎 𝗠𝗜𝗕: a 𝗠echanistic 𝗜nterpretability 𝗕enchmark!
We propose 😎 𝗠𝗜𝗕: a 𝗠echanistic 𝗜nterpretability 𝗕enchmark!
April 23, 2025 at 6:15 PM
Lots of progress in mech interp (MI) lately! But how can we measure when new mech interp methods yield real improvements over prior work?
We propose 😎 𝗠𝗜𝗕: a 𝗠echanistic 𝗜nterpretability 𝗕enchmark!
We propose 😎 𝗠𝗜𝗕: a 𝗠echanistic 𝗜nterpretability 𝗕enchmark!
Reposted by Dana Arad
🚨🚨 New preprint 🚨🚨
Ever wonder whether verbalized CoTs correspond to the internal reasoning process of the model?
We propose a novel parametric faithfulness approach, which erases information contained in CoT steps from the model parameters to assess CoT faithfulness.
arxiv.org/abs/2502.14829
Ever wonder whether verbalized CoTs correspond to the internal reasoning process of the model?
We propose a novel parametric faithfulness approach, which erases information contained in CoT steps from the model parameters to assess CoT faithfulness.
arxiv.org/abs/2502.14829
Measuring Faithfulness of Chains of Thought by Unlearning Reasoning Steps
When prompted to think step-by-step, language models (LMs) produce a chain of thought (CoT), a sequence of reasoning steps that the model supposedly used to produce its prediction. However, despite mu...
arxiv.org
February 21, 2025 at 12:43 PM
🚨🚨 New preprint 🚨🚨
Ever wonder whether verbalized CoTs correspond to the internal reasoning process of the model?
We propose a novel parametric faithfulness approach, which erases information contained in CoT steps from the model parameters to assess CoT faithfulness.
arxiv.org/abs/2502.14829
Ever wonder whether verbalized CoTs correspond to the internal reasoning process of the model?
We propose a novel parametric faithfulness approach, which erases information contained in CoT steps from the model parameters to assess CoT faithfulness.
arxiv.org/abs/2502.14829
Reposted by Dana Arad
🚨New arXiv preprint!🚨
LLMs can hallucinate - but did you know they can do so with high certainty even when they know the correct answer? 🤯
We find those hallucinations in our latest work with @itay-itzhak.bsky.social, @fbarez.bsky.social, @gabistanovsky.bsky.social and Yonatan Belinkov
LLMs can hallucinate - but did you know they can do so with high certainty even when they know the correct answer? 🤯
We find those hallucinations in our latest work with @itay-itzhak.bsky.social, @fbarez.bsky.social, @gabistanovsky.bsky.social and Yonatan Belinkov
February 19, 2025 at 3:50 PM
🚨New arXiv preprint!🚨
LLMs can hallucinate - but did you know they can do so with high certainty even when they know the correct answer? 🤯
We find those hallucinations in our latest work with @itay-itzhak.bsky.social, @fbarez.bsky.social, @gabistanovsky.bsky.social and Yonatan Belinkov
LLMs can hallucinate - but did you know they can do so with high certainty even when they know the correct answer? 🤯
We find those hallucinations in our latest work with @itay-itzhak.bsky.social, @fbarez.bsky.social, @gabistanovsky.bsky.social and Yonatan Belinkov
Reposted by Dana Arad
If you’re interested in mechanistic interpretability, I just found this starter pack and wanted to boost it (thanks for creating it @butanium.bsky.social !). Excited to have a mech interp community on bluesky 🎉
go.bsky.app/LisK3CP
go.bsky.app/LisK3CP
November 19, 2024 at 12:28 AM
If you’re interested in mechanistic interpretability, I just found this starter pack and wanted to boost it (thanks for creating it @butanium.bsky.social !). Excited to have a mech interp community on bluesky 🎉
go.bsky.app/LisK3CP
go.bsky.app/LisK3CP
Reposted by Dana Arad
November 4, 2024 at 10:01 AM