Eric Olav Chen
banner
eochen.bsky.social
Eric Olav Chen
@eochen.bsky.social
Econ Predoc at GPI, working on micro theory.
However, when the developer cannot commit, the best they can achieve without imperfect recall is no better than not testing the AI at all—in equilibrium, imperfect recall is essential for the developer to ever hope to do better than simply blindly delegating to some AI.
December 12, 2024 at 1:57 PM
(4) Is imperfect recall the only way to achieve these results? With commitment, the principal can achieve the same results with perfect recall, by randomizing the number of tests.
December 12, 2024 at 1:57 PM
The developer can still perfectly screen as the number of tests grows large, as long as the agent does not receive information that perfectly reveals to it that it is in deployment (or does so arbitrarily well even as the number of tests goes to infinity).
December 12, 2024 at 1:57 PM
(3) The assumption that the AI has no ability to distinguish testing from deployment is strong. What if the agent receives information about whether it is currently in testing or deployment?
December 12, 2024 at 1:57 PM
(2) What if the principal cannot commit? The ability of the developer to perfectly screen as the number of tests grows large remains intact.
December 12, 2024 at 1:57 PM
However, it can sometimes be better for the developer to deploy a sufficiently disciplined misaligned AI. In such cases, imperfect recall allows the developer to discipline arbitrarily well and achieve strictly more than the perfect screening payoff.
December 12, 2024 at 1:57 PM
(1) Suppose the developer can commit to a deployment policy and has access to a large number of tests. They can then leverage imperfect recall to screen away the misaligned AI arbitrarily well.
December 12, 2024 at 1:57 PM
Here are the main results.
December 12, 2024 at 1:57 PM
A misaligned AI will sometimes pursue its true objective in testing and get caught by the developer—the screening effect. To avoid detection, a misaligned AI is incentivized to behave aligned in deployment—the disciplining effect.
December 12, 2024 at 1:57 PM
We allow the developer to (i) simulate the real task, and (ii) impose imperfect recall on the agent, thereby obscuring whether it is currently in testing or deployment. This gives rise to two key dynamics that drive many of the main results.
December 12, 2024 at 1:57 PM

We consider a principal-agent delegation scenario with some special features: the developer cannot restrict the agent's actions, alter its preferences or beliefs, or enforce punishment, contract, or otherwise control it after deployment. However, the developer can simulate copies of the agent.
December 12, 2024 at 1:57 PM
So a developer will want to test a potentially misaligned AI before delegating some important task to them. But if the AI is sufficiently advanced and is strategically aware, it can simply play nice in testing, and standard tests become useless. This is the deceptive alignment worry.
December 12, 2024 at 1:57 PM
As of today, there is no robust solution to this alignment problem. (arxiv.org/pdf/2209.00626)
arxiv.org
December 12, 2024 at 1:57 PM
Modern AI systems are black boxes. They are able to perform increasingly complex tasks in increasingly diverse domains, and yet we cannot reliably infer or attribute benign goals to them. One serious concern is that such systems will learn to pursue unintended goals that conflict with our interests.
December 12, 2024 at 1:57 PM