Fazl Barez
fbarez.bsky.social
Fazl Barez
@fbarez.bsky.social
Let's build AI's we can trust!
Pinned
Excited to share our paper: "Chain-of-Thought Is Not Explainability"! We unpack a critical misconception in AI: models explaining their steps (CoT) aren't necessarily revealing their true reasoning. Spoiler: the transparency can be an illusion. (1/9) 🧵
🚨New AI Safety Course
@aims_oxford
!

I’m thrilled to launch a new called AI Safety & Alignment (AISAA) course on the foundations & frontier research of making advanced AI systems safe and aligned at
@UniofOxford

what to expect 👇
robots.ox.ac.uk/~fazl/aisaa/
October 6, 2025 at 4:40 PM
Reposted by Fazl Barez
Evaluating the Infinite
🧵
My latest paper tries to solve a longstanding problem afflicting fields such as decision theory, economics, and ethics — the problem of infinities.
Let me explain a bit about what causes the problem and how my solution avoids it.
1/N
arxiv.org/abs/2509.19389
Evaluating the Infinite
I present a novel mathematical technique for dealing with the infinities arising from divergent sums and integrals. It assigns them fine-grained infinite values from the set of hyperreal numbers in a ...
arxiv.org
September 25, 2025 at 3:28 PM
🚀 Excited to have 2 papers accepted at #NeurIP2025! 🎉 congrats to my amazing co-authors!

More details (and more bragging) soon! and maybe even more news on sep 25 👀

See you all in… Mexico? San Diego? Copenhagen? Who knows! 🌍✈️
September 19, 2025 at 9:08 AM
Reposted by Fazl Barez
🚨 NEW PAPER 🚨: Embodied AI (incl. AI-powered drones, self-driving cars and robots) is here, but policies are lagging. We analyzed the EAI risks and found significant gaps in governance

arxiv.org/pdf/2509.00117

Co-authors Jared Perlo @fbarez.bsky.social Alex Robey & @floridi.bsky.social

1\4
September 4, 2025 at 5:51 PM
Reposted by Fazl Barez
Other works have highlighted that CoTs ≠ explainability alphaxiv.org/abs/2025.02 (@fbarez.bsky.social), and that intermediate (CoT) tokens ≠ reasoning traces arxiv.org/abs/2504.09762 (@rao2z.bsky.social).

Here, FUR offers a fine-grained test if LMs latently used information from CoTs for answers!
Chain-of-Thought Is Not Explainability | alphaXiv
View 3 comments: There should be a balance of both subjective and observable methodologies. Adhering to just one is a fools errand.
alphaxiv.org
August 21, 2025 at 3:21 PM
Reposted by Fazl Barez
It is so easy to confuse chain of thought and explainability and in fact in a lot of the media it is presented as if with current LLMs we are allowed to view their actual thought processes. It is not that!
Excited to share our paper: "Chain-of-Thought Is Not Explainability"! We unpack a critical misconception in AI: models explaining their steps (CoT) aren't necessarily revealing their true reasoning. Spoiler: the transparency can be an illusion. (1/9) 🧵
July 2, 2025 at 12:41 PM
Excited to share our paper: "Chain-of-Thought Is Not Explainability"! We unpack a critical misconception in AI: models explaining their steps (CoT) aren't necessarily revealing their true reasoning. Spoiler: the transparency can be an illusion. (1/9) 🧵
July 1, 2025 at 3:41 PM
Technology = power. AI is reshaping power — fast.

Today’s AI doesn’t just assist decisions; it makes them. Governments use it for surveillance, prediction, and control — often with no oversight.

Technical safeguards aren’t enough on their own — but they’re essential for AI to serve society.
June 27, 2025 at 8:07 AM
Reposted by Fazl Barez
And Anna Yelizarov, @fbarez.bsky.social, @scasper.bsky.social, Beatrice Erkers, among others.

We'll draw from political theory, cooperative AI, economics, mechanism design, history, and hierarchical agency.
June 18, 2025 at 6:12 PM
Reposted by Fazl Barez
This is a step toward targeted, interpretable, and robust knowledge removal — at the parameter level.

Joint work with Clara Suslik, Yihuai Hong, and @fbarez.bsky.social, advised by @megamor2.bsky.social
🔗 Paper: arxiv.org/abs/2505.22586
🔗 Code: github.com/yoavgur/PISCES
May 29, 2025 at 4:22 PM
Come work with me at Oxford this summer! Paid research opportunity to:

White-box LLMs & model security
Safe RL & reward hacking
Interpretability & governance tools

Remote or Oxford.

Apply by 30 May 23:59 UTC. DM with questions.
May 20, 2025 at 5:13 PM
Come work with me at Oxford!

We’re hiring a Postdoc in Causal Systems Modelling to:

- Build causal & white-box models that make frontier AI safer and more transparent
- Turn technical insights into safety cases, policy briefs, and governance tools
]

DM if you have any questions.
May 15, 2025 at 11:12 AM
First-time Area Chair seeking advice! What helped you most when evaluating papers beyond just averaging scores?

After suffering through unhelpful reviews as an author, I want to do right by papers in my track.
April 8, 2025 at 11:59 AM
Reposted by Fazl Barez
🎉 Our Actionable Interpretability workshop has been accepted to #ICML2025! 🎉
> Follow @actinterp.bsky.social
> Website actionable-interpretability.github.io

@talhaklay.bsky.social @anja.re @mariusmosbach.bsky.social @sarah-nlp.bsky.social @iftenney.bsky.social

Paper submission deadline: May 9th!
March 31, 2025 at 4:59 PM
Reposted by Fazl Barez
April 1, 2025 at 2:58 PM
Technical AI Governance (TAIG) at #ICML2025 this July in Vancouver!

Credit to
Ben and Lisa for all the work!

We have a new centre at Oxford working on technical AI governance with Robert Trager and @maosbot.bsky.social many other great minds. We are hiring - please reach out!
Quote
📣We’re thrilled to announce the first workshop on Technical AI Governance (TAIG) at #ICML2025 this July in Vancouver! Join us (& this stellar list of speakers) in bringing together technical & policy experts to shape the future of AI governance! www.taig-icml.com
April 1, 2025 at 3:10 PM
Reposted by Fazl Barez
Life update: I'm starting as faculty at Boston University
@bucds.bsky.social in 2026! BU has SCHEMES for LM interpretability & analysis, I couldn't be more pumped to join a burgeoning supergroup w/ @najoung.bsky.social @amuuueller.bsky.social. Looking for my first students, so apply and reach out!
March 27, 2025 at 2:24 AM
Reposted by Fazl Barez
New paper alert!

Curious how small prompt tweaks impact LLM accuracy but don’t want to run endless inferences? We got you. Meet DOVE - a dataset built to uncover these sensitivities.

Use DOVE for your analysis or contribute samples -we're growing and welcome you aboard!
Care about LLM evaluation? 🤖 🤔

We bring you ️️🕊️ DOVE a massive (250M!) collection of LLMs outputs 
On different prompts, domains, tokens, models...

Join our community effort to expand it with YOUR model predictions & become a co-author!
March 17, 2025 at 4:33 PM
Reposted by Fazl Barez
What happens once AI can design better AI, which can itself design better AI? Will we get an "intelligence explosion" where AI capabilities increase very rapidly? Tom Davidson, Rose Hadshar and I have a new paper out with analysis of these dynamics.
March 17, 2025 at 2:54 PM
Reposted by Fazl Barez
My group @FLAIR_Ox is recruiting a postdoc and looking for someone who can get started by the end of April. Deadline to apply is in one week (!), 19th of March at noon, so please help spread the word: my.corehr.com/pls/uoxrecru...
Job Details
my.corehr.com
March 12, 2025 at 3:17 PM
Reposted by Fazl Barez
1/13 LLM circuits tell us where the computation happens inside the model—but the computation varies by token position, a key detail often ignored!
We propose a method to automatically find position-aware circuits, improving faithfulness while keeping circuits compact. 🧵👇
March 6, 2025 at 10:15 PM
🔍 Excited to share our paper: "Same Question, Different Words: A Latent Adversarial Framework for Prompt Robustness"!
March 4, 2025 at 5:24 PM
New paper alert! 🚨

Important question: Do SAEs generalise?
We explore the answerability detection in LLMs by comparing SAE features vs. linear residual stream probes.

Answer:
probes outperform SAE features in-domain, out-of-domain generalization varies sharply between features and datasets. 🧵
March 1, 2025 at 6:14 PM
Reposted by Fazl Barez
🚨New arXiv preprint!🚨
LLMs can hallucinate - but did you know they can do so with high certainty even when they know the correct answer? 🤯
We find those hallucinations in our latest work with @itay-itzhak.bsky.social, @fbarez.bsky.social, @gabistanovsky.bsky.social and Yonatan Belinkov
February 19, 2025 at 3:50 PM
Reposted by Fazl Barez
We are excited to welcome Fazl Barez @fbarez.bsky.social, who joins us as a senior postdoctoral research fellow. He will be leading research initiatives in AI safety and interpretability.
@oxmartinschool.bsky.social

Find out more: www.oxfordmartin.ox.ac.uk/people/fazl-...
February 18, 2025 at 3:37 PM