Lightnews — Scholar-powered news

Dana Arad

@danaarad.bsky.social

60 followers 230 following 23 posts

NLP Researcher | CS PhD Candidate @ Technion

Posts Replies Media Videos

Pinned

Dana Arad @danaarad.bsky.social · May 27

Tried steering with SAEs and found that not all features behave as expected?

Check out our new preprint - "SAEs Are Good for Steering - If You Select the Right Features" 🧵

Dana Arad

@danaarad.bsky.social

Now accepted to EMNLP Main Conference!

Dana Arad @danaarad.bsky.social · May 27

Tried steering with SAEs and found that not all features behave as expected?

Check out our new preprint - "SAEs Are Good for Steering - If You Select the Right Features" 🧵

August 20, 2025 at 7:38 PM

Dana Arad

@danaarad.bsky.social

Submit your work to #BlackboxNLP 2025!

BlackboxNLP @blackboxnlp.bsky.social · Aug 12

📢 Call for Papers! 📢
#BlackboxNLP 2025 invites the submission of archival and non-archival papers on interpreting and explaining NLP models.

📅 Deadlines: Aug 15 (direct submissions), Sept 5 (ARR commitment)
🔗 More details: blackboxnlp.github.io/2025/call/

August 12, 2025 at 7:13 PM

Dana Arad

@danaarad.bsky.social

Excited to spend the rest of the summer visiting @davidbau.bsky.social's lab at Northeastern! If you’re in the area and want to chat about interpretability, let me know ☕️

August 10, 2025 at 1:56 PM

Reposted by Dana Arad

Itay Itzhak @ COLM 🍁

@itay-itzhak.bsky.social

In Vienna for #ACL2025, and already had my first (vegan) Austrian sausage!

Now hungry for discussing:
– LLMs behavior
– Interpretability
– Biases & Hallucinations
– Why eval is so hard (but so fun)
Come say hi if that’s your vibe too!

July 27, 2025 at 6:11 AM

Dana Arad

@danaarad.bsky.social

10 days to go! Still time to run your method and submit!

BlackboxNLP @blackboxnlp.bsky.social · Jul 23

Just 10 days to go until the results submission deadline for the MIB Shared Task at #BlackboxNLP!

If you're working on:
🧠 Circuit discovery
🔍 Feature attribution
🧪 Causal variable localization
now’s the time to polish and submit!

Join us on Discord: discord.gg/n5uwjQcxPR

July 23, 2025 at 8:21 AM

Dana Arad

@danaarad.bsky.social

Three weeks is plenty of time to submit your method!

BlackboxNLP @blackboxnlp.bsky.social · Jul 13

⏳ Three weeks left! Submit your work to the MIB Shared Task at #BlackboxNLP, co-located with @emnlpmeeting.bsky.social

Whether you're working on circuit discovery or causal variable localization, this is your chance to benchmark your method in a rigorous setup!

July 13, 2025 at 6:11 AM

Dana Arad

@danaarad.bsky.social

What are you working on for the MIB shared task?

Check out the full task description here: blackboxnlp.github.io/2025/task/

July 9, 2025 at 7:21 AM

Reposted by Dana Arad

BlackboxNLP

@blackboxnlp.bsky.social

New to mechanistic interpretability?
The MIB shared task is a great opportunity to experiment:
✅ Clean setup
✅ Open baseline code
✅ Standard evaluation

Join the discord server for ideas and discussions: discord.gg/n5uwjQcxPR

July 7, 2025 at 8:42 AM

Dana Arad

@danaarad.bsky.social

VLMs perform better on questions about text than when answering the same questions about images - but why? and how can we fix it?

In a new project led by Yaniv (@YNikankin on the other app), we investigate this gap from an mechanistic perspective, and use our findings to close a third of it! 🧵

June 26, 2025 at 10:41 AM

Reposted by Dana Arad

BlackboxNLP

@blackboxnlp.bsky.social

Working on circuit discovery in LMs?
Consider submitting your work to the MIB Shared Task, part of #BlackboxNLP at @emnlpmeeting.bsky.social 2025!

The goal: benchmark existing MI methods and identify promising directions to precisely and concisely recover causal pathways in LMs >>

June 24, 2025 at 2:24 PM

Reposted by Dana Arad

BlackboxNLP

@blackboxnlp.bsky.social

Have you heard about this year's shared task? 📢

Mechanistic Interpretability (MI) is quickly advancing, but comparing methods remains a challenge. This year at #BlackboxNLP, we're introducing a shared task to rigorously evaluate MI methods in language models 🧵

June 23, 2025 at 2:46 PM

Reposted by Dana Arad

Aaron Mueller

@amuuueller.bsky.social

SAEs have been found to massively underperform supervised methods for steering neural networks.

In new work led by @danaarad.bsky.social, we find that this problem largely disappears if you select the right features!

Dana Arad @danaarad.bsky.social · May 27

Tried steering with SAEs and found that not all features behave as expected?

Check out our new preprint - "SAEs Are Good for Steering - If You Select the Right Features" 🧵

May 27, 2025 at 5:07 PM

Dana Arad

@danaarad.bsky.social

Tried steering with SAEs and found that not all features behave as expected?

Check out our new preprint - "SAEs Are Good for Steering - If You Select the Right Features" 🧵

May 27, 2025 at 4:06 PM

Reposted by Dana Arad

Aaron Mueller

@amuuueller.bsky.social

Lots of progress in mech interp (MI) lately! But how can we measure when new mech interp methods yield real improvements over prior work?

We propose 😎 𝗠𝗜𝗕: a 𝗠echanistic 𝗜nterpretability 𝗕enchmark!

April 23, 2025 at 6:15 PM

Reposted by Dana Arad

Martin Tutek @ EMNLP

@mtutek.bsky.social

🚨🚨 New preprint 🚨🚨

Ever wonder whether verbalized CoTs correspond to the internal reasoning process of the model?

We propose a novel parametric faithfulness approach, which erases information contained in CoT steps from the model parameters to assess CoT faithfulness.

arxiv.org/abs/2502.14829

Measuring Faithfulness of Chains of Thought by Unlearning Reasoning Steps

When prompted to think step-by-step, language models (LMs) produce a chain of thought (CoT), a sequence of reasoning steps that the model supposedly used to produce its prediction. However, despite mu...

arxiv.org

February 21, 2025 at 12:43 PM

Reposted by Dana Arad

Adi Simhi

@adisimhi.bsky.social

🚨New arXiv preprint!🚨
LLMs can hallucinate - but did you know they can do so with high certainty even when they know the correct answer? 🤯
We find those hallucinations in our latest work with @itay-itzhak.bsky.social, @fbarez.bsky.social, @gabistanovsky.bsky.social and Yonatan Belinkov

February 19, 2025 at 3:50 PM

Reposted by Dana Arad

Sweta Karlekar

@swetakar.bsky.social

If you’re interested in mechanistic interpretability, I just found this starter pack and wanted to boost it (thanks for creating it @butanium.bsky.social !). Excited to have a mech interp community on bluesky 🎉

go.bsky.app/LisK3CP

November 19, 2024 at 12:28 AM

Reposted by Dana Arad

Maria Antoniak

@mariaa.bsky.social

A starter pack for #NLP #NLProc researchers! 🎉

go.bsky.app/SngwGeS

November 4, 2024 at 10:01 AM

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news