Lightnews — Scholar-powered news

Maarten Buyl

@maartenbuyl.bsky.social

9/n Full paper here: 🔗 arxiv.org/abs/2502.10441. Huge thanks to my amazing team of co-authors: @hadikh.bsky.social, @lucasmpaes.bsky.social, @claudiomv.bsky.social, @caiocvm.bsky.social, and @fcalmon.bsky.social. Done at @harvard.edu

AI Alignment at Your Discretion

In AI alignment, extensive latitude must be granted to annotators, either human or algorithmic, to judge which model outputs are `better' or `safer.' We refer to this latitude as alignment discretion....

arxiv.org

February 19, 2025 at 9:08 PM

Maarten Buyl

@maartenbuyl.bsky.social

8/n We argue that, without a deeper understanding of alignment discretion, today’s AI alignment process risks resembling a kangaroo court—where annotators wield unchecked power to shape AI behavior without transparency or oversight.

February 19, 2025 at 9:08 PM

Maarten Buyl

@maartenbuyl.bsky.social

7/n Though the discretion of GPT-4o, DeepSeek-V3, and Sonnet 3.5 is mostly similar (see above), they exhibit subtle differences. Here are some examples:

February 19, 2025 at 9:08 PM

Maarten Buyl

@maartenbuyl.bsky.social

6/n We can go further: by analyzing how often each principle ‘wins’ or ‘loses,’ we compute an ELO rating for each, giving a sense of that principle’s ‘priority’. This reveals huge discrepancies between human annotators, trained reward models, and LLMs.

February 19, 2025 at 9:08 PM

Maarten Buyl

@maartenbuyl.bsky.social

5/n In conflict cases, annotators must decide which principle takes priority. By tracking how often different principles ‘win,’ we can measure which values annotators prioritize most. For example in HH-RLHF, we uncover distinct patterns in annotator decision-making:

February 19, 2025 at 9:08 PM

Maarten Buyl

@maartenbuyl.bsky.social

4/n In consensus cases, we’d expect annotators to agree. In reality, we find widespread arbitrariness—annotators frequently override clear consensus in alignment datasets like HH-RLHF and PKU-SafeRLHF. And when algorithms annotate? The pattern varies significantly.

February 19, 2025 at 9:08 PM

Maarten Buyl

@maartenbuyl.bsky.social

3/n We formalize that a set of alignment principles can agree/disagree in one of three ways:
1️⃣ Consensus (all relevant principles agree on the best output)
2️⃣ Conflicts (some principles disagree on the best output)
3️⃣ Indifference (no principle applies)

February 19, 2025 at 9:08 PM

Maarten Buyl

@maartenbuyl.bsky.social

2/n AI alignment is done with preference-based learning: annotators rank responses, teaching models what’s "helpful" and "harmless." However, just like judges make decisions based on laws (judicial discretion), annotators balance conflicting principles (alignment discretion).

February 19, 2025 at 9:08 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news