Benjamin Hilton
benjamin-hilton.bsky.social
Benjamin Hilton
@benjamin-hilton.bsky.social
Alignment @AISI. Semi-informed about economics, physics and governments. views my own
Humans are often very wrong.

This is a big problem if you want to use human judgment to oversee super-smart AI systems.

In our new post, @girving.bsky.social argues that we might be able to deal with this issue – not by fixing the humans, but by redesigning oversight protocols.
May 14, 2025 at 3:39 PM
Want to build an aligned ASI? Our new paper explains how to do that, using debate.

Tl;dr:

Debate + exploration guarantees + no obfuscated arguments + good human input = outer alignment

Outer alignment + online training = inner alignment*

* sufficient for low-stakes contexts
May 8, 2025 at 5:23 PM
The Alignment Team at UK AISI now has a research agenda.

Our goal: solve the alignment problem.
How: develop concrete, parallelisable open problems.

Our initial focus is on asymptotic honesty guarantees (more details in the post).

1/5
May 7, 2025 at 5:56 PM
Interested in getting UK AISI support to do alignment research?

Fill in our short, < 5-min form, and we'll get back on proposals within 1 week.

(Caveat: While we may reach out about AISI funding your project, filling out this form is not an application for funding.)
April 16, 2025 at 4:39 PM