Lightnews — Scholar-powered news

Eric Olav Chen

@eochen.bsky.social

Much more in the paper, check it out!

globalprioritiesinstitute.org/imperfect-re...

Imperfect Recall and AI Delegation - Eric Olav Chen, Alexis Ghersengorin and Sami Petersen

A principal wants to deploy an artificial intelligence (AI) system to perform some task. But the AI may be misaligned and aim to pursue a conflicting objective. The principal cannot restrict its optio...

globalprioritiesinstitute.org

December 12, 2024 at 1:57 PM

Eric Olav Chen

@eochen.bsky.social

However, when the developer cannot commit, the best they can achieve without imperfect recall is no better than not testing the AI at all—in equilibrium, imperfect recall is essential for the developer to ever hope to do better than simply blindly delegating to some AI.

December 12, 2024 at 1:57 PM

Eric Olav Chen

@eochen.bsky.social

(4) Is imperfect recall the only way to achieve these results? With commitment, the principal can achieve the same results with perfect recall, by randomizing the number of tests.

December 12, 2024 at 1:57 PM

Eric Olav Chen

@eochen.bsky.social

The developer can still perfectly screen as the number of tests grows large, as long as the agent does not receive information that perfectly reveals to it that it is in deployment (or does so arbitrarily well even as the number of tests goes to infinity).

December 12, 2024 at 1:57 PM

Eric Olav Chen

@eochen.bsky.social

(3) The assumption that the AI has no ability to distinguish testing from deployment is strong. What if the agent receives information about whether it is currently in testing or deployment?

December 12, 2024 at 1:57 PM

Eric Olav Chen

@eochen.bsky.social

(2) What if the principal cannot commit? The ability of the developer to perfectly screen as the number of tests grows large remains intact.

December 12, 2024 at 1:57 PM

Eric Olav Chen

@eochen.bsky.social

However, it can sometimes be better for the developer to deploy a sufficiently disciplined misaligned AI. In such cases, imperfect recall allows the developer to discipline arbitrarily well and achieve strictly more than the perfect screening payoff.

December 12, 2024 at 1:57 PM

Eric Olav Chen

@eochen.bsky.social

(1) Suppose the developer can commit to a deployment policy and has access to a large number of tests. They can then leverage imperfect recall to screen away the misaligned AI arbitrarily well.

December 12, 2024 at 1:57 PM

Eric Olav Chen

@eochen.bsky.social

Here are the main results.

December 12, 2024 at 1:57 PM

Eric Olav Chen

@eochen.bsky.social

A misaligned AI will sometimes pursue its true objective in testing and get caught by the developer—the screening effect. To avoid detection, a misaligned AI is incentivized to behave aligned in deployment—the disciplining effect.

December 12, 2024 at 1:57 PM

Eric Olav Chen

@eochen.bsky.social

We allow the developer to (i) simulate the real task, and (ii) impose imperfect recall on the agent, thereby obscuring whether it is currently in testing or deployment. This gives rise to two key dynamics that drive many of the main results.

December 12, 2024 at 1:57 PM

Eric Olav Chen

@eochen.bsky.social

We consider a principal-agent delegation scenario with some special features: the developer cannot restrict the agent's actions, alter its preferences or beliefs, or enforce punishment, contract, or otherwise control it after deployment. However, the developer can simulate copies of the agent.

December 12, 2024 at 1:57 PM

Eric Olav Chen

@eochen.bsky.social

So a developer will want to test a potentially misaligned AI before delegating some important task to them. But if the AI is sufficiently advanced and is strategically aware, it can simply play nice in testing, and standard tests become useless. This is the deceptive alignment worry.

December 12, 2024 at 1:57 PM

Eric Olav Chen

@eochen.bsky.social

As of today, there is no robust solution to this alignment problem. (arxiv.org/pdf/2209.00626)

arxiv.org

December 12, 2024 at 1:57 PM

Eric Olav Chen

@eochen.bsky.social

Modern AI systems are black boxes. They are able to perform increasingly complex tasks in increasingly diverse domains, and yet we cannot reliably infer or attribute benign goals to them. One serious concern is that such systems will learn to pursue unintended goals that conflict with our interests.

December 12, 2024 at 1:57 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news