Lightnews — Scholar-powered news

Dana Arad

@danaarad.bsky.social

60 followers 230 following 23 posts

NLP Researcher | CS PhD Candidate @ Technion

Posts Replies Media Videos

Pinned

Dana Arad @danaarad.bsky.social · May 27

Tried steering with SAEs and found that not all features behave as expected?

Check out our new preprint - "SAEs Are Good for Steering - If You Select the Right Features" 🧵

Dana Arad

@danaarad.bsky.social

Now accepted to EMNLP Main Conference!

Dana Arad @danaarad.bsky.social · May 27

Tried steering with SAEs and found that not all features behave as expected?

Check out our new preprint - "SAEs Are Good for Steering - If You Select the Right Features" 🧵

August 20, 2025 at 7:38 PM

Dana Arad

@danaarad.bsky.social

Submit your work to #BlackboxNLP 2025!

BlackboxNLP @blackboxnlp.bsky.social · Aug 12

📢 Call for Papers! 📢
#BlackboxNLP 2025 invites the submission of archival and non-archival papers on interpreting and explaining NLP models.

📅 Deadlines: Aug 15 (direct submissions), Sept 5 (ARR commitment)
🔗 More details: blackboxnlp.github.io/2025/call/

August 12, 2025 at 7:13 PM

Dana Arad

@danaarad.bsky.social

Excited to spend the rest of the summer visiting @davidbau.bsky.social's lab at Northeastern! If you’re in the area and want to chat about interpretability, let me know ☕️

August 10, 2025 at 1:56 PM

Reposted by Dana Arad

Itay Itzhak @ COLM 🍁

@itay-itzhak.bsky.social

In Vienna for #ACL2025, and already had my first (vegan) Austrian sausage!

Now hungry for discussing:
– LLMs behavior
– Interpretability
– Biases & Hallucinations
– Why eval is so hard (but so fun)
Come say hi if that’s your vibe too!

July 27, 2025 at 6:11 AM

Dana Arad

@danaarad.bsky.social

10 days to go! Still time to run your method and submit!

BlackboxNLP @blackboxnlp.bsky.social · Jul 23

Just 10 days to go until the results submission deadline for the MIB Shared Task at #BlackboxNLP!

If you're working on:
🧠 Circuit discovery
🔍 Feature attribution
🧪 Causal variable localization
now’s the time to polish and submit!

Join us on Discord: discord.gg/n5uwjQcxPR

July 23, 2025 at 8:21 AM

Dana Arad

@danaarad.bsky.social

Three weeks is plenty of time to submit your method!

BlackboxNLP @blackboxnlp.bsky.social · Jul 13

⏳ Three weeks left! Submit your work to the MIB Shared Task at #BlackboxNLP, co-located with @emnlpmeeting.bsky.social

Whether you're working on circuit discovery or causal variable localization, this is your chance to benchmark your method in a rigorous setup!

July 13, 2025 at 6:11 AM

Dana Arad

@danaarad.bsky.social

What are you working on for the MIB shared task?

Check out the full task description here: blackboxnlp.github.io/2025/task/

July 9, 2025 at 7:21 AM

Reposted by Dana Arad

BlackboxNLP

@blackboxnlp.bsky.social

New to mechanistic interpretability?
The MIB shared task is a great opportunity to experiment:
✅ Clean setup
✅ Open baseline code
✅ Standard evaluation

Join the discord server for ideas and discussions: discord.gg/n5uwjQcxPR

July 7, 2025 at 8:42 AM

Dana Arad

@danaarad.bsky.social

VLMs perform better on questions about text than when answering the same questions about images - but why? and how can we fix it?

In a new project led by Yaniv (@YNikankin on the other app), we investigate this gap from an mechanistic perspective, and use our findings to close a third of it! 🧵

June 26, 2025 at 10:41 AM

Reposted by Dana Arad

BlackboxNLP

@blackboxnlp.bsky.social

Working on circuit discovery in LMs?
Consider submitting your work to the MIB Shared Task, part of #BlackboxNLP at @emnlpmeeting.bsky.social 2025!

The goal: benchmark existing MI methods and identify promising directions to precisely and concisely recover causal pathways in LMs >>

June 24, 2025 at 2:24 PM

Reposted by Dana Arad

BlackboxNLP

@blackboxnlp.bsky.social

Have you heard about this year's shared task? 📢

Mechanistic Interpretability (MI) is quickly advancing, but comparing methods remains a challenge. This year at #BlackboxNLP, we're introducing a shared task to rigorously evaluate MI methods in language models 🧵

June 23, 2025 at 2:46 PM

Reposted by Dana Arad

Aaron Mueller

@amuuueller.bsky.social

SAEs have been found to massively underperform supervised methods for steering neural networks.

In new work led by @danaarad.bsky.social, we find that this problem largely disappears if you select the right features!

Dana Arad @danaarad.bsky.social · May 27

Tried steering with SAEs and found that not all features behave as expected?

Check out our new preprint - "SAEs Are Good for Steering - If You Select the Right Features" 🧵

May 27, 2025 at 5:07 PM

Dana Arad

@danaarad.bsky.social

Tried steering with SAEs and found that not all features behave as expected?

Check out our new preprint - "SAEs Are Good for Steering - If You Select the Right Features" 🧵

May 27, 2025 at 4:06 PM

Reposted by Dana Arad

Aaron Mueller

@amuuueller.bsky.social

Lots of progress in mech interp (MI) lately! But how can we measure when new mech interp methods yield real improvements over prior work?

We propose 😎 𝗠𝗜𝗕: a 𝗠echanistic 𝗜nterpretability 𝗕enchmark!

April 23, 2025 at 6:15 PM

Reposted by Dana Arad

Martin Tutek

@mtutek.bsky.social

🚨🚨 New preprint 🚨🚨

Ever wonder whether verbalized CoTs correspond to the internal reasoning process of the model?

We propose a novel parametric faithfulness approach, which erases information contained in CoT steps from the model parameters to assess CoT faithfulness.

arxiv.org/abs/2502.14829

Measuring Faithfulness of Chains of Thought by Unlearning Reasoning Steps

When prompted to think step-by-step, language models (LMs) produce a chain of thought (CoT), a sequence of reasoning steps that the model supposedly used to produce its prediction. However, despite mu...

arxiv.org

February 21, 2025 at 12:43 PM

Reposted by Dana Arad

Adi Simhi

@adisimhi.bsky.social

🚨New arXiv preprint!🚨
LLMs can hallucinate - but did you know they can do so with high certainty even when they know the correct answer? 🤯
We find those hallucinations in our latest work with @itay-itzhak.bsky.social, @fbarez.bsky.social, @gabistanovsky.bsky.social and Yonatan Belinkov

February 19, 2025 at 3:50 PM

Reposted by Dana Arad

Sweta Karlekar

@swetakar.bsky.social

If you’re interested in mechanistic interpretability, I just found this starter pack and wanted to boost it (thanks for creating it @butanium.bsky.social !). Excited to have a mech interp community on bluesky 🎉

go.bsky.app/LisK3CP

November 19, 2024 at 12:28 AM

Reposted by Dana Arad

Maria Antoniak

@mariaa.bsky.social

A starter pack for #NLP #NLProc researchers! 🎉

go.bsky.app/SngwGeS

November 4, 2024 at 10:01 AM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news