Joschka Braun
joschkabraun.bsky.social
Joschka Braun
@joschkabraun.bsky.social
ML Master's student at University Tübingen | Researching Deep Learning, LLMs, & AI Safety at KASL & Health NLP Lab | https://joschkacbraun.github.io/
6/💡 Best practice:
A hybrid approach—combining steering vectors with prompt engineering—achieves the best balance between effective control and high quality summaries at moderate steering strengths.
July 13, 2025 at 2:36 PM
5/⚠️ Efficacy-Quality trade-off:
High steering strengths (|λ| > 2) increase control over targeted properties but significantly degrade fluency, diversity & faithfulness of generated summaries, aligning with prior findings that stronger steering degrades model performance.
July 13, 2025 at 2:36 PM
4/ ✅ Main findings:
• Steering vectors effectively control topical focus, sentiment & readability (stronger λ → larger effects)
• Steering alone can’t induce toxicity in safety-tuned Llama 3; only a combination of steering and prompting yields toxic summaries
July 13, 2025 at 2:36 PM
6/ Can training prompt types improve steering reliability? While average net-positive effects were similar across all 7 prompt types, their effect size variance remained high, often causing net-negative outcomes. So, we did not achieve prompting-based reliability improvements.
May 23, 2025 at 10:04 AM
5/ b) Separability of positive / negative training activations along steering direction also predicts steering success. This makes sense: better-differentiated representations of the target behavior and its opposite => measurable change in behavior is more likely after steering.
May 23, 2025 at 10:04 AM
4/ a) Directional Agreement – how well the training activation differences align – predicts steering success. Intuitively, differences of training samples consistently pointing in the same direction means steering with a fixed vector is a decent approximation.
May 23, 2025 at 10:04 AM