Maarten Buyl
maartenbuyl.bsky.social
Maarten Buyl
@maartenbuyl.bsky.social
8/n We argue that, without a deeper understanding of alignment discretion, today’s AI alignment process risks resembling a kangaroo court—where annotators wield unchecked power to shape AI behavior without transparency or oversight.
February 19, 2025 at 9:08 PM
7/n Though the discretion of GPT-4o, DeepSeek-V3, and Sonnet 3.5 is mostly similar (see above), they exhibit subtle differences. Here are some examples:
February 19, 2025 at 9:08 PM
6/n We can go further: by analyzing how often each principle ‘wins’ or ‘loses,’ we compute an ELO rating for each, giving a sense of that principle’s ‘priority’. This reveals huge discrepancies between human annotators, trained reward models, and LLMs.
February 19, 2025 at 9:08 PM
5/n In conflict cases, annotators must decide which principle takes priority. By tracking how often different principles ‘win,’ we can measure which values annotators prioritize most. For example in HH-RLHF, we uncover distinct patterns in annotator decision-making:
February 19, 2025 at 9:08 PM
4/n In consensus cases, we’d expect annotators to agree. In reality, we find widespread arbitrariness—annotators frequently override clear consensus in alignment datasets like HH-RLHF and PKU-SafeRLHF. And when algorithms annotate? The pattern varies significantly.
February 19, 2025 at 9:08 PM
3/n We formalize that a set of alignment principles can agree/disagree in one of three ways:
1️⃣ Consensus (all relevant principles agree on the best output)
2️⃣ Conflicts (some principles disagree on the best output)
3️⃣ Indifference (no principle applies)
February 19, 2025 at 9:08 PM
2/n AI alignment is done with preference-based learning: annotators rank responses, teaching models what’s "helpful" and "harmless." However, just like judges make decisions based on laws (judicial discretion), annotators balance conflicting principles (alignment discretion).
February 19, 2025 at 9:08 PM