⏳ The generalization of interpretability-based steering methods is at an inflection point
🚂 As a community, we need to place stronger emphasis on evaluating the reliability of methods if we care about long-term impact
⏳ The generalization of interpretability-based steering methods is at an inflection point
🚂 As a community, we need to place stronger emphasis on evaluating the reliability of methods if we care about long-term impact
⏳ The generalization of interpretability-based steering methods is at an inflection point
🚂 As a community, we need to place stronger emphasis on evaluating the reliability of methods if we care about long-term impact
📍Wed, 9AM, Level 2 Hall A
🍁I will also share this work at Actionable Interpretability @ActInterp at #ICML2025
📍Sat, 1PM, East Ballroom A
We study 3 popular steering methods with 36 models from 14 families (1.5-70B), exposing brittle performance and fundamental flaws in underlying assumptions
🧵👇
(1/10)
📍Wed, 9AM, Level 2 Hall A
🍁I will also share this work at Actionable Interpretability @ActInterp at #ICML2025
📍Sat, 1PM, East Ballroom A
We study 3 popular steering methods with 36 models from 14 families (1.5-70B), exposing brittle performance and fundamental flaws in underlying assumptions
🧵👇
(1/10)
We study 3 popular steering methods with 36 models from 14 families (1.5-70B), exposing brittle performance and fundamental flaws in underlying assumptions
🧵👇
(1/10)