Dr Francis Rhys Ward
f-rhys-ward.bsky.social
Dr Francis Rhys Ward
@f-rhys-ward.bsky.social
AGI Alignment Researcher
Thanks to my esteemed collaborators Jack Foxabbott
and Rohan Subramani!

And thanks to @tom4everitt.bsky.social Joe Halpern James Fox Jonathan Richens Matt MacDermott Ryan Carey Paul Rapoport @korbi01.bsky.social for invaluable feedback and discussion! :)
March 16, 2025 at 4:44 PM
Our paper was accepted to AAMAS 25 and you can find it here t.co/VNDTPz5lim
https://arxiv.org/abs/2503.06323
t.co
March 16, 2025 at 4:44 PM
In our theory, agents may have different subjective models of the world, but these subjective beliefs may be constrained by objective reality (cf. Tom and Jon above). I've found this useful for thinking about ELK and hope that future work can lead to solution proposals.
March 16, 2025 at 4:44 PM
ELK requires describing how a human can provide a training incentive — in objective reality — which elicits an AI’s subjective states, even if these two agents have different conceptual models of reality (a.k.a., "ontology mismatch") or incorrect beliefs about each other's models
March 16, 2025 at 4:44 PM
We hope that our theory can be used to formalise the problem of eliciting latent knowledge (ELK) — the problem of designing a training regime to get an AI system to report what it “knows".
t.co/3eHpSFvlGV
https://ai-alignment.com/eliciting-latent-knowledge-f977478608fc
t.co
March 16, 2025 at 4:44 PM
For instance, @tom4everitt.bsky.social and Jonathan Richens show an agent that is robust to distributional shifts must have internalised a causal model of the world, i.e., its subjective beliefs must capture the causal information in the training environment.
t.co/Ptfv0BOXzC
https://arxiv.org/abs/2402.10877
t.co
March 16, 2025 at 4:44 PM
Is this kind of theory useful? Many foundational challenges for building safe agents rely on understanding an agent’s subjective beliefs, and how these depend on the objective world (e.g., on the training environment).
March 16, 2025 at 4:44 PM
Causal models can represent agents, deception, and generalisation. We extend causal models (really: multi-agent influence models) to settings of incomplete information. This lets us formally reason about strategic interactions between agents with different subjective beliefs.
March 16, 2025 at 4:44 PM