Lightnews — Scholar-powered news

Benjamin Hilton

@benjamin-hilton.bsky.social

As always, we'd be very excited to collaborate on further research. If you're interested in collaborating with UK AISI, you can express interest at forms.office.com/e/BFbeUeWYQ9. If you're a non-profit or academic, you can also apply for grants up to £200,000 directly at aisi.gov.uk/grants.

Microsoft Forms

forms.office.com

May 14, 2025 at 3:39 PM

Benjamin Hilton

@benjamin-hilton.bsky.social

Link to the post: www.alignmentforum.org/posts/EgRJtw...

Dodging systematic human errors in scalable oversight — AI Alignment Forum

How one might strengthen a debate protocol to mitigate failures arising from systematic human errors.

www.alignmentforum.org

May 14, 2025 at 3:39 PM

Benjamin Hilton

@benjamin-hilton.bsky.social

Huge thanks to Marie Buhl, Jacob Pfau and @girving.bsky.social for all their work on this. Excited to get stuck in to future work!

May 8, 2025 at 5:23 PM

Benjamin Hilton

@benjamin-hilton.bsky.social

Link to the alignment forum post:
www.alignmentforum.org/posts/iELyAq...

An alignment safety case sketch based on debate — AI Alignment Forum

This post presents a mildly edited form of a new paper by UK AISI's alignment team (the abstract, introduction and related work section are replaced…

www.alignmentforum.org

May 8, 2025 at 5:23 PM

Benjamin Hilton

@benjamin-hilton.bsky.social

Link to the paper: arxiv.org/abs/2505.03989

An alignment safety case sketch based on debate

If AI systems match or exceed human capabilities on a wide range of tasks, it may become difficult for humans to efficiently judge their actions -- making it hard to use human feedback to steer them t...

arxiv.org

May 8, 2025 at 5:23 PM

Benjamin Hilton

@benjamin-hilton.bsky.social

There are still loads of open problems.

We need to get each part of the above right – exploration guarantees and human input particularly stand out to me (optimistic about obfuscated arguments, stand by for future publications...)

May 8, 2025 at 5:23 PM

Benjamin Hilton

@benjamin-hilton.bsky.social

Two things that stand out for me from this paper:
– Debate gets you correctness/honesty. That's not sufficient for harmlessness, but is a great first step.
– Low-stakes alignment (where you get single errors, but not errors on average) seems (imo) totally do-able

May 8, 2025 at 5:23 PM

Benjamin Hilton

@benjamin-hilton.bsky.social

This is just the start. We'll be following this up shortly with:
– A safety case sketch for debate, giving a whole host more detail on the open problems.
– A series of posts (something like 1 a week) diving into various problems we'd like to see solved.

5/5

May 7, 2025 at 5:56 PM

Benjamin Hilton

@benjamin-hilton.bsky.social

We've included a long list of open problems we'd like people to solve – and a reminder that you can express interest in collaborating, and apply to our challenge fund for grant funding!

bsky.app/profile/benj...

4/5

Benjamin Hilton @benjamin-hilton.bsky.social · Apr 16

Interested in getting UK AISI support to do alignment research?

Fill in our short, < 5-min form, and we'll get back on proposals within 1 week.

(Caveat: While we may reach out about AISI funding your project, filling out this form is not an application for funding.)

May 7, 2025 at 5:56 PM

Benjamin Hilton

@benjamin-hilton.bsky.social

The post sets out:

Why we're excited about safety cases
Why we focus (initially) on honesty
What we mean when we talk about 'asymptotic guarantees'

3/5

May 7, 2025 at 5:56 PM

Benjamin Hilton

@benjamin-hilton.bsky.social

Link to the detailed alignment team agenda: alignmentforum.org/posts/tbnw7L...

Link to AISI's research agenda: aisi.gov.uk/research-age...

2/5

UK AISI’s Alignment Team: Research Agenda — AI Alignment Forum

The UK’s AI Security Institute published its research agenda yesterday. This post gives more details about how the Alignment Team is thinking about o…

alignmentforum.org

May 7, 2025 at 5:56 PM

Benjamin Hilton

@benjamin-hilton.bsky.social

You can also apply directly for funding via the AISI Challenge Fund:
www.aisi.gov.uk/grants

Grants | The AI Security Institute (AISI)

View AISI grants. The AI Security Institute is a directorate of the Department of Science, Innovation, and Technology that facilitates rigorous research to enable advanced AI governance.

www.aisi.gov.uk

April 16, 2025 at 4:39 PM

Benjamin Hilton

@benjamin-hilton.bsky.social

Identifying people to work with is the biggest bottleneck for the UK AISI alignment team right now. Help out by filling in or sharing the form below:
forms.office.com/e/BFbeUeWYQ9

Microsoft Forms

forms.office.com

April 16, 2025 at 4:39 PM

Benjamin Hilton

@benjamin-hilton.bsky.social

We’re particularly excited to hear from:
– ML researchers
– Complexity theorists
– Game theorists
– Cognitive scientists
– People who could build datasets
– People who could run human studies
– Anyone else who thinks they might be doing, or could be doing, relevant work

April 16, 2025 at 4:39 PM

Benjamin Hilton

@benjamin-hilton.bsky.social

We’re trying to massively scale up the total global effort going into security-relevant alignment research, to prevent superhuman AI from posing critical risk.

We do this by:
1. Identify key alignment subproblems
2. Identifying people who can solve them
3. Funding research

April 16, 2025 at 4:39 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news