Lightnews — Scholar-powered news

Joschka Braun

@joschkabraun.bsky.social

ML Master's student at University Tübingen | Researching Deep Learning, LLMs, & AI Safety at KASL & Health NLP Lab | https://joschkacbraun.github.io/

Posts Replies Media Videos

Joschka Braun

@joschkabraun.bsky.social

6/💡 Best practice:
A hybrid approach—combining steering vectors with prompt engineering—achieves the best balance between effective control and high quality summaries at moderate steering strengths.

July 13, 2025 at 2:36 PM

Joschka Braun

@joschkabraun.bsky.social

5/⚠️ Efficacy-Quality trade-off:
High steering strengths (|λ| > 2) increase control over targeted properties but significantly degrade fluency, diversity & faithfulness of generated summaries, aligning with prior findings that stronger steering degrades model performance.

July 13, 2025 at 2:36 PM

Joschka Braun

@joschkabraun.bsky.social

4/ ✅ Main findings:
• Steering vectors effectively control topical focus, sentiment & readability (stronger λ → larger effects)
• Steering alone can’t induce toxicity in safety-tuned Llama 3; only a combination of steering and prompting yields toxic summaries

July 13, 2025 at 2:36 PM

Joschka Braun

@joschkabraun.bsky.social

6/ Can training prompt types improve steering reliability? While average net-positive effects were similar across all 7 prompt types, their effect size variance remained high, often causing net-negative outcomes. So, we did not achieve prompting-based reliability improvements.

May 23, 2025 at 10:04 AM

Joschka Braun

@joschkabraun.bsky.social

5/ b) Separability of positive / negative training activations along steering direction also predicts steering success. This makes sense: better-differentiated representations of the target behavior and its opposite => measurable change in behavior is more likely after steering.

May 23, 2025 at 10:04 AM

Joschka Braun

@joschkabraun.bsky.social

4/ a) Directional Agreement – how well the training activation differences align – predicts steering success. Intuitively, differences of training samples consistently pointing in the same direction means steering with a fixed vector is a decent approximation.

May 23, 2025 at 10:04 AM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news