Lightnews — Scholar-powered news

Geoffrey Irving

@girving.bsky.social

The UK AI Security Institute ran an Alignment Conference from 29-31 November in London! The goal was to gather a mix of people experienced in and new to alignment, and get into the details of novel approaches to alignment and related problems. Hopefully we helped create some new research bets! 🧵

November 13, 2025 at 5:00 PM

Geoffrey Irving

@girving.bsky.social

Another strong transition from @matt-levine.bsky.social.

October 23, 2025 at 7:59 PM

Geoffrey Irving

@girving.bsky.social

New open source library from the UK AI Security Institute! ControlArena lowers the barrier to secure and reproducible AI control research, to boost work on blocking and detecting malicious actions in case AI models are misaligned. In use by researchers at GDM, Anthropic, Redwood, and MATS! 🧵

October 22, 2025 at 6:04 PM

Geoffrey Irving

@girving.bsky.social

Ominous start to a Wikipedia page about a formula...

en.wikipedia.org/wiki/Fa%C3%A...

September 29, 2025 at 9:02 PM

Geoffrey Irving

@girving.bsky.social

From near the end of Sleepwalkers, by Christopher Clark, as World War I starts.

An English traveller recalled the reaction in an Altai (Semipalatinsk) Cossack settlement when the 'blue flag' borne aloft by a rider and the noise of bugles playing the alarm brought news of mobilization. The Tsar had spoken, and the Cossacks, with their unique military calling and tradition, 'burned to fight the enemy'. But who was that enemy? Nobody knew. The mobilization telegram provided no details. Rumours abounded. At first everyone imagined that the war must be with China - 'Russia had pushed too far into Mongolia and China had declared war.
Then another rumour did the rounds: 'It is with England, with England. This view prevailed for some time.

> Only after four days did something like the truth come to us, and then nobody believed it.

August 23, 2025 at 3:40 PM

Geoffrey Irving

@girving.bsky.social

Short note on relativisation in debate protocols: to model AI training protocols, we need results that hold even if our source of truth (humans for instance) is a black box that can't be introspected. With @benjamin-hilton.bsky.social and Simon Marshall. 🧵

www.alignmentforum.org/posts/XycoFu...

June 26, 2025 at 4:46 PM

Geoffrey Irving

@girving.bsky.social

New alignment theory paper! We present a new scalable oversight protocol (prover-estimator debate) and a proof that honesty is incentivised at equilibrium (with large assumptions, see 🧵), even when the AIs involved have similar available compute.

The original recursive debate protocol suffered from the obfuscated arguments problem: debater A could decompose an easy question x into hard subclaims y_1, y_2, . . . , y_q , and debater B would fail to find the flaw even if he knew one existed. In prover-estimator debate, B assigns
probabilities to subclaims and A chooses a probability to claim that B is wrong in a specific direction. Since A must point to a flaw in B’s probabilities, B wins if neither player can locate a flaw.

June 17, 2025 at 4:52 PM

Geoffrey Irving

@girving.bsky.social

Going back through old blog posts, and I still love these old cloth collision event visualizations.

naml.us/post/visualizi…

May 31, 2025 at 2:18 PM

Geoffrey Irving

@girving.bsky.social

AISI's research agenda is out! We cover a variety of topics in the evaluation and mitigations of risks from frontier LLMs, including both work happening at AISI and work we are excited to see others tackle.

www.aisi.gov.uk/research-age...

Breakdown of LLM risk domain areas (cyber misuse, dual-use science risks, criminal misuse, autonomous systems risks, societal resilience risks, and human influence risks) and research components (understand risks, build and run evaluations, research and assess mitigations).

May 6, 2025 at 10:55 AM

Geoffrey Irving

@girving.bsky.social

A new gem I just discovered: how to paste an image on top of a pdf in Preview. :)

apple.stackexchange.com/questions/37...

May 6, 2025 at 9:11 AM

Geoffrey Irving

@girving.bsky.social

Such a difference could be super subtle. Models seem to able to make impressive inferences from just texture, such as in this image Scott Alexander tried: astralcodexten.com/p/testing-ai...

Somewhat blurry image of a river from Scott Alexander, which o3 is also able to guess (but less precisely).

May 2, 2025 at 12:02 PM

Geoffrey Irving

@girving.bsky.social

The LLM geoguesser discussions remind me of the trapdoor technique Jonah Brown-Cohen and I were tinkering with for alignment purposes. Say you want to whether a model can do a task. How could you know this, without being able to verify individual answers? 🧵

x.com/KelseyTuoc/s...

Kelsey Piper's original image of a kid flying a kite on a beach, for which o3 is able to guess the specific beach.

May 2, 2025 at 12:02 PM

Geoffrey Irving

@girving.bsky.social

Reading my electricity meter seems to have held up as a hard benchmark for VLMs. Here’s o3 thinking for ~6 minutes and arriving at the wrong answer. (o4-mini also gets it wrong.)

April 27, 2025 at 9:46 AM

Geoffrey Irving

@girving.bsky.social

Please apply if interested!

t.co/AqlwmxvVdH

April 16, 2025 at 4:44 PM

Geoffrey Irving

@girving.bsky.social

The most beautiful equation in mathematics.

April 16, 2025 at 4:17 AM

Geoffrey Irving

@girving.bsky.social

February 12, 2025 at 9:27 PM

Geoffrey Irving

@girving.bsky.social

Once we're in the two element free-group, the paradoxical decomposition can be visualised directly: there's nothing fundamentally infinite about it. Wikipedia has a nice picture and discussion.

en.wikipedia.org/wiki/Banach%...

December 7, 2024 at 12:49 PM

Geoffrey Irving

@girving.bsky.social

Soon:

December 2, 2024 at 10:54 PM

Geoffrey Irving

@girving.bsky.social

It gets worse: in 2020 it was a *four* word phrase.

en.m.wikipedia.org/wiki/Word_of...

December 2, 2024 at 8:01 PM

Geoffrey Irving

@girving.bsky.social

November 28, 2024 at 7:36 PM

Geoffrey Irving

@girving.bsky.social

Böttcher coordinates map from outside the Mandelbrot set to outside the disk. You can make an animation by zooming, since the coordinates are the identity near infinity.

I made it before I knew the math well, and it's much more satisfying to watch now that it's formalised in github.com/girving/ray.