Joschka Braun
joschkabraun.bsky.social
Joschka Braun
@joschkabraun.bsky.social
ML Master's student at University Tübingen | Researching Deep Learning, LLMs, & AI Safety at KASL & Health NLP Lab | https://joschkacbraun.github.io/
7/ 🙌 Huge thanks
to my coauthors @CarstenEickhoff and @ABH878 from the Health NLP Lab Tübingen for their amazing mentorship!

🔗 Read the full paper: openreview.net/forum?id=sbm...
💻 Code & steering-vector datasets: github.com/JoschkaCBrau...
July 13, 2025 at 2:36 PM
6/💡 Best practice:
A hybrid approach—combining steering vectors with prompt engineering—achieves the best balance between effective control and high quality summaries at moderate steering strengths.
July 13, 2025 at 2:36 PM
5/⚠️ Efficacy-Quality trade-off:
High steering strengths (|λ| > 2) increase control over targeted properties but significantly degrade fluency, diversity & faithfulness of generated summaries, aligning with prior findings that stronger steering degrades model performance.
July 13, 2025 at 2:36 PM
4/ ✅ Main findings:
• Steering vectors effectively control topical focus, sentiment & readability (stronger λ → larger effects)
• Steering alone can’t induce toxicity in safety-tuned Llama 3; only a combination of steering and prompting yields toxic summaries
July 13, 2025 at 2:36 PM
3/ 🔬 Method:
We apply steering vectors (Panickssery et al. '24) during summarization on NEWTS (Bahrainian et al. '22).
We evaluate how steering affects:
• Target properties
• Intrinsic quality (fluency & diversity)
• Extrinsic quality (faithfulness to human reference summary)
July 13, 2025 at 2:36 PM
2/ 🔍 Key question:
Can we adaptively control topical focus, sentiment, readability, and toxicity without degrading summary quality?
Prior work has mostly evaluated steering vectors in multiple-choice settings, reporting unreliable effect sizes (Tan et al. '24).
July 13, 2025 at 2:36 PM
7/7 Huge thanks to my amazing coauthors: Carsten Eickhoff, David Krueger, Seyed Ali Bahrainian and Dmitrii Krasheninnikov!  Full paper here: openreview.net/forum?id=qGC...
Understanding (Un)Reliability of Steering Vectors in Language Models
Steering vectors are a lightweight method to control language model behavior by adding a learned bias to the activations at inference time. Although steering demonstrates promising performance...
openreview.net
May 23, 2025 at 10:04 AM
6/ Can training prompt types improve steering reliability? While average net-positive effects were similar across all 7 prompt types, their effect size variance remained high, often causing net-negative outcomes. So, we did not achieve prompting-based reliability improvements.
May 23, 2025 at 10:04 AM
5/ b) Separability of positive / negative training activations along steering direction also predicts steering success. This makes sense: better-differentiated representations of the target behavior and its opposite => measurable change in behavior is more likely after steering.
May 23, 2025 at 10:04 AM
4/ a) Directional Agreement – how well the training activation differences align – predicts steering success. Intuitively, differences of training samples consistently pointing in the same direction means steering with a fixed vector is a decent approximation.
May 23, 2025 at 10:04 AM
3/ Background: We study CAA (difference-of-mean-acts) steering vectors (Panickssery et al. '24). Prior work showed their reliability varies significantly (Tan et al. '24). Our findings validate intuitive geometric predictors of steering success with experiments & visualizations!
May 23, 2025 at 10:04 AM
2/ Higher steering reliability is predictable from intuitive properties of steering vectors’ training activation geometry: a) higher directional agreement (cosine similarity) of activation differences in training data; b) better separation between positive & negative activations.
May 23, 2025 at 10:04 AM