@patqdasilva.bsky.social
Pinned
Super grateful to have received senior area chair highlight at #ACL2025NLP
⏳ The generalization of interpretability-based steering methods is at an inflection point
🚂 As a community, we need to place stronger emphasis on evaluating the reliability of methods if we care about long-term impact
⏳ The generalization of interpretability-based steering methods is at an inflection point
🚂 As a community, we need to place stronger emphasis on evaluating the reliability of methods if we care about long-term impact
Super grateful to have received senior area chair highlight at #ACL2025NLP
⏳ The generalization of interpretability-based steering methods is at an inflection point
🚂 As a community, we need to place stronger emphasis on evaluating the reliability of methods if we care about long-term impact
⏳ The generalization of interpretability-based steering methods is at an inflection point
🚂 As a community, we need to place stronger emphasis on evaluating the reliability of methods if we care about long-term impact
July 30, 2025 at 4:10 PM
Super grateful to have received senior area chair highlight at #ACL2025NLP
⏳ The generalization of interpretability-based steering methods is at an inflection point
🚂 As a community, we need to place stronger emphasis on evaluating the reliability of methods if we care about long-term impact
⏳ The generalization of interpretability-based steering methods is at an inflection point
🚂 As a community, we need to place stronger emphasis on evaluating the reliability of methods if we care about long-term impact
🌟Excited to announce that “Steering off Course” was accepted to #ACL2025NLP for an Oral and Panel Discussion! arxiv.org/abs/2504.04635
📍Wed, 9AM, Level 2 Hall A
🍁I will also share this work at Actionable Interpretability @ActInterp at #ICML2025
📍Sat, 1PM, East Ballroom A
📍Wed, 9AM, Level 2 Hall A
🍁I will also share this work at Actionable Interpretability @ActInterp at #ICML2025
📍Sat, 1PM, East Ballroom A
Steering language models by directly intervening on internal activations is appealing–but does it generalize?
We study 3 popular steering methods with 36 models from 14 families (1.5-70B), exposing brittle performance and fundamental flaws in underlying assumptions
🧵👇
(1/10)
We study 3 popular steering methods with 36 models from 14 families (1.5-70B), exposing brittle performance and fundamental flaws in underlying assumptions
🧵👇
(1/10)
July 16, 2025 at 5:03 PM
🌟Excited to announce that “Steering off Course” was accepted to #ACL2025NLP for an Oral and Panel Discussion! arxiv.org/abs/2504.04635
📍Wed, 9AM, Level 2 Hall A
🍁I will also share this work at Actionable Interpretability @ActInterp at #ICML2025
📍Sat, 1PM, East Ballroom A
📍Wed, 9AM, Level 2 Hall A
🍁I will also share this work at Actionable Interpretability @ActInterp at #ICML2025
📍Sat, 1PM, East Ballroom A
Steering language models by directly intervening on internal activations is appealing–but does it generalize?
We study 3 popular steering methods with 36 models from 14 families (1.5-70B), exposing brittle performance and fundamental flaws in underlying assumptions
🧵👇
(1/10)
We study 3 popular steering methods with 36 models from 14 families (1.5-70B), exposing brittle performance and fundamental flaws in underlying assumptions
🧵👇
(1/10)
April 8, 2025 at 11:34 AM
Steering language models by directly intervening on internal activations is appealing–but does it generalize?
We study 3 popular steering methods with 36 models from 14 families (1.5-70B), exposing brittle performance and fundamental flaws in underlying assumptions
🧵👇
(1/10)
We study 3 popular steering methods with 36 models from 14 families (1.5-70B), exposing brittle performance and fundamental flaws in underlying assumptions
🧵👇
(1/10)