Tal Haklay
@talhaklay.bsky.social
NLP | Interpretability | PhD student at the Technion
Position-aware Automatic Circuit Discovery – Project Page
peap-circuits.github.io
May 22, 2025 at 8:11 AM
Website & CFP >> actionable-interpretability.github.io
General Information
July 19 - ICML 2025 - Vancouver
actionable-interpretability.github.io
May 14, 2025 at 1:04 PM
Website & CFP >> actionable-interpretability.github.io
Reposted by Tal Haklay
This was a huge collaboration with many great folks! If you get a chance, be sure to talk to Atticus Geiger, @sarah-nlp.bsky.social, @danaarad.bsky.social, Iván Arcuschin, @adambelfki.bsky.social, @yiksiu.bsky.social, Jaden Fiotto-Kaufmann, @talhaklay.bsky.social, @michaelwhanna.bsky.social, ...
April 23, 2025 at 6:15 PM
This was a huge collaboration with many great folks! If you get a chance, be sure to talk to Atticus Geiger, @sarah-nlp.bsky.social, @danaarad.bsky.social, Iván Arcuschin, @adambelfki.bsky.social, @yiksiu.bsky.social, Jaden Fiotto-Kaufmann, @talhaklay.bsky.social, @michaelwhanna.bsky.social, ...
Website >> actionable-interpretability.github.io
General Information
ICML 2025 - Vancouver
actionable-interpretability.github.io
April 7, 2025 at 1:51 PM
Website >> actionable-interpretability.github.io
6. Position papers: Critical discussions on the feasibility, limitations, and future directions of actionable interpretability research. We also invite perspectives that question whether actionability should be a goal of interpretability research.
April 7, 2025 at 1:51 PM
6. Position papers: Critical discussions on the feasibility, limitations, and future directions of actionable interpretability research. We also invite perspectives that question whether actionability should be a goal of interpretability research.
5. Developing realistic benchmarking and assessment methods to measure the real-world impact of interpretability insights, particularly in production environments and large-scale models.
April 7, 2025 at 1:51 PM
5. Developing realistic benchmarking and assessment methods to measure the real-world impact of interpretability insights, particularly in production environments and large-scale models.
4. Incorporating interpretability–often focusing on micro-level decision analysis–into more complex scenarios, like reasoning processes or multi-turn interactions.
April 7, 2025 at 1:51 PM
4. Incorporating interpretability–often focusing on micro-level decision analysis–into more complex scenarios, like reasoning processes or multi-turn interactions.
3. New model architectures, training paradigms or design choices informed by interpretability findings.
April 7, 2025 at 1:51 PM
3. New model architectures, training paradigms or design choices informed by interpretability findings.
2. Comparative analyses of interpretability-based approaches versus alternative techniques like fine-tuning, prompting, and more.
April 7, 2025 at 1:51 PM
2. Comparative analyses of interpretability-based approaches versus alternative techniques like fine-tuning, prompting, and more.
1.Practical applications of interpretability insights to address key challenges in AI such as hallucinations, biases, and adversarial robustness, as well as applications in high-stakes, less-explored domains like healthcare, finance, and cybersecurity.
April 7, 2025 at 1:51 PM
1.Practical applications of interpretability insights to address key challenges in AI such as hallucinations, biases, and adversarial robustness, as well as applications in high-stakes, less-explored domains like healthcare, finance, and cybersecurity.
13/13 This work was done in collaboration with @hadasorgad.bsky.social , @davidbau.bsky.social , @amuuueller.bsky.social and @boknilev.bsky.social.
💡 Thoughts? Questions? Let’s discuss!
Website >> peap-circuits.github.io
Arxiv >> arxiv.org/abs/2502.04577
💡 Thoughts? Questions? Let’s discuss!
Website >> peap-circuits.github.io
Arxiv >> arxiv.org/abs/2502.04577
Position-aware Automatic Circuit Discovery
A widely used strategy to discover and understand language model mechanisms is circuit analysis. A circuit is a minimal subgraph of a model's computation graph that executes a specific task. We identi...
arxiv.org
March 6, 2025 at 10:15 PM
13/13 This work was done in collaboration with @hadasorgad.bsky.social , @davidbau.bsky.social , @amuuueller.bsky.social and @boknilev.bsky.social.
💡 Thoughts? Questions? Let’s discuss!
Website >> peap-circuits.github.io
Arxiv >> arxiv.org/abs/2502.04577
💡 Thoughts? Questions? Let’s discuss!
Website >> peap-circuits.github.io
Arxiv >> arxiv.org/abs/2502.04577
12/13 We evaluate our automatic pipeline across three datasets and two models, demonstrating that:
1️⃣ Our pipeline discovers circuits with a better tradeoff between size and faithfulness compared to EAP.
2️⃣ Our pipeline produces results comparable to those obtained when human experts define a schema.
1️⃣ Our pipeline discovers circuits with a better tradeoff between size and faithfulness compared to EAP.
2️⃣ Our pipeline produces results comparable to those obtained when human experts define a schema.
March 6, 2025 at 10:15 PM
12/13 We evaluate our automatic pipeline across three datasets and two models, demonstrating that:
1️⃣ Our pipeline discovers circuits with a better tradeoff between size and faithfulness compared to EAP.
2️⃣ Our pipeline produces results comparable to those obtained when human experts define a schema.
1️⃣ Our pipeline discovers circuits with a better tradeoff between size and faithfulness compared to EAP.
2️⃣ Our pipeline produces results comparable to those obtained when human experts define a schema.
11/13 But where does this schema come from? And how do we determine the boundaries of each span within each example? Sounds like we just added more work for researchers! 😅
Actually, we show that an LLM (Claude) can do a pretty decent job at defining a schema and tagging all examples accordingly.
Actually, we show that an LLM (Claude) can do a pretty decent job at defining a schema and tagging all examples accordingly.
March 6, 2025 at 10:15 PM
11/13 But where does this schema come from? And how do we determine the boundaries of each span within each example? Sounds like we just added more work for researchers! 😅
Actually, we show that an LLM (Claude) can do a pretty decent job at defining a schema and tagging all examples accordingly.
Actually, we show that an LLM (Claude) can do a pretty decent job at defining a schema and tagging all examples accordingly.
10/13 After defining a schema, we construct an abstract computation graph where each span type corresponds to a single token position. We then map attribution scores from example-specific computation graphs to the abstract graph and identify circuits within it.
March 6, 2025 at 10:15 PM
10/13 After defining a schema, we construct an abstract computation graph where each span type corresponds to a single token position. We then map attribution scores from example-specific computation graphs to the abstract graph and identify circuits within it.
9/13 To address this problem, we introduce the concept of a 𝙙𝙖𝙩𝙖𝙨𝙚𝙩 𝙨𝙘𝙝𝙚𝙢𝙖, which defines token spans with similar semantics across examples in the dataset.
March 6, 2025 at 10:15 PM
9/13 To address this problem, we introduce the concept of a 𝙙𝙖𝙩𝙖𝙨𝙚𝙩 𝙨𝙘𝙝𝙚𝙢𝙖, which defines token spans with similar semantics across examples in the dataset.
8/13 But you may notice an issue...
What if the examples in a dataset vary in length and structure?
Discovering a circuit in such cases is not straightforward, leading many researchers to focus only on datasets with uniform length and structure.
What if the examples in a dataset vary in length and structure?
Discovering a circuit in such cases is not straightforward, leading many researchers to focus only on datasets with uniform length and structure.
March 6, 2025 at 10:15 PM
8/13 But you may notice an issue...
What if the examples in a dataset vary in length and structure?
Discovering a circuit in such cases is not straightforward, leading many researchers to focus only on datasets with uniform length and structure.
What if the examples in a dataset vary in length and structure?
Discovering a circuit in such cases is not straightforward, leading many researchers to focus only on datasets with uniform length and structure.
7/13 First improvement :
We introduce 𝗣𝗼𝘀𝗶𝘁𝗶𝗼𝗻𝗮𝗹 𝗘𝗱𝗴𝗲 𝗔𝘁𝘁𝗿𝗶𝗯𝘂𝘁𝗶𝗼𝗻 𝗣𝗮𝘁𝗰𝗵𝗶𝗻𝗴 (𝗣𝗘𝗔𝗣)
—an extension of EAP that allows us to discover circuits that differentiate between token positions. The key advancement? Our approach uncovers "attention edges", revealing dependencies missed by previous methods.
We introduce 𝗣𝗼𝘀𝗶𝘁𝗶𝗼𝗻𝗮𝗹 𝗘𝗱𝗴𝗲 𝗔𝘁𝘁𝗿𝗶𝗯𝘂𝘁𝗶𝗼𝗻 𝗣𝗮𝘁𝗰𝗵𝗶𝗻𝗴 (𝗣𝗘𝗔𝗣)
—an extension of EAP that allows us to discover circuits that differentiate between token positions. The key advancement? Our approach uncovers "attention edges", revealing dependencies missed by previous methods.
March 6, 2025 at 10:15 PM
7/13 First improvement :
We introduce 𝗣𝗼𝘀𝗶𝘁𝗶𝗼𝗻𝗮𝗹 𝗘𝗱𝗴𝗲 𝗔𝘁𝘁𝗿𝗶𝗯𝘂𝘁𝗶𝗼𝗻 𝗣𝗮𝘁𝗰𝗵𝗶𝗻𝗴 (𝗣𝗘𝗔𝗣)
—an extension of EAP that allows us to discover circuits that differentiate between token positions. The key advancement? Our approach uncovers "attention edges", revealing dependencies missed by previous methods.
We introduce 𝗣𝗼𝘀𝗶𝘁𝗶𝗼𝗻𝗮𝗹 𝗘𝗱𝗴𝗲 𝗔𝘁𝘁𝗿𝗶𝗯𝘂𝘁𝗶𝗼𝗻 𝗣𝗮𝘁𝗰𝗵𝗶𝗻𝗴 (𝗣𝗘𝗔𝗣)
—an extension of EAP that allows us to discover circuits that differentiate between token positions. The key advancement? Our approach uncovers "attention edges", revealing dependencies missed by previous methods.
6/13 The Problem:
Automatic circuit discovery methods like Edge Attribution Patching (EAP) and EAP-IP implicitly assume that circuits are position-invariant—they do not differentiate between components at different token positions.
As a result, the circuit may include irrelevant components.
Automatic circuit discovery methods like Edge Attribution Patching (EAP) and EAP-IP implicitly assume that circuits are position-invariant—they do not differentiate between components at different token positions.
As a result, the circuit may include irrelevant components.
March 6, 2025 at 10:15 PM
6/13 The Problem:
Automatic circuit discovery methods like Edge Attribution Patching (EAP) and EAP-IP implicitly assume that circuits are position-invariant—they do not differentiate between components at different token positions.
As a result, the circuit may include irrelevant components.
Automatic circuit discovery methods like Edge Attribution Patching (EAP) and EAP-IP implicitly assume that circuits are position-invariant—they do not differentiate between components at different token positions.
As a result, the circuit may include irrelevant components.