Pinned
Fazl Barez
@fbarez.bsky.social
· Jul 1
Excited to share our paper: "Chain-of-Thought Is Not Explainability"! We unpack a critical misconception in AI: models explaining their steps (CoT) aren't necessarily revealing their true reasoning. Spoiler: the transparency can be an illusion. (1/9) 🧵
🚨New AI Safety Course
@aims_oxford
!
I’m thrilled to launch a new called AI Safety & Alignment (AISAA) course on the foundations & frontier research of making advanced AI systems safe and aligned at
@UniofOxford
what to expect 👇
robots.ox.ac.uk/~fazl/aisaa/
@aims_oxford
!
I’m thrilled to launch a new called AI Safety & Alignment (AISAA) course on the foundations & frontier research of making advanced AI systems safe and aligned at
@UniofOxford
what to expect 👇
robots.ox.ac.uk/~fazl/aisaa/
October 6, 2025 at 4:40 PM
🚨New AI Safety Course
@aims_oxford
!
I’m thrilled to launch a new called AI Safety & Alignment (AISAA) course on the foundations & frontier research of making advanced AI systems safe and aligned at
@UniofOxford
what to expect 👇
robots.ox.ac.uk/~fazl/aisaa/
@aims_oxford
!
I’m thrilled to launch a new called AI Safety & Alignment (AISAA) course on the foundations & frontier research of making advanced AI systems safe and aligned at
@UniofOxford
what to expect 👇
robots.ox.ac.uk/~fazl/aisaa/
Reposted by Fazl Barez
Evaluating the Infinite
🧵
My latest paper tries to solve a longstanding problem afflicting fields such as decision theory, economics, and ethics — the problem of infinities.
Let me explain a bit about what causes the problem and how my solution avoids it.
1/N
arxiv.org/abs/2509.19389
🧵
My latest paper tries to solve a longstanding problem afflicting fields such as decision theory, economics, and ethics — the problem of infinities.
Let me explain a bit about what causes the problem and how my solution avoids it.
1/N
arxiv.org/abs/2509.19389
Evaluating the Infinite
I present a novel mathematical technique for dealing with the infinities arising from divergent sums and integrals. It assigns them fine-grained infinite values from the set of hyperreal numbers in a ...
arxiv.org
September 25, 2025 at 3:28 PM
Evaluating the Infinite
🧵
My latest paper tries to solve a longstanding problem afflicting fields such as decision theory, economics, and ethics — the problem of infinities.
Let me explain a bit about what causes the problem and how my solution avoids it.
1/N
arxiv.org/abs/2509.19389
🧵
My latest paper tries to solve a longstanding problem afflicting fields such as decision theory, economics, and ethics — the problem of infinities.
Let me explain a bit about what causes the problem and how my solution avoids it.
1/N
arxiv.org/abs/2509.19389
🚀 Excited to have 2 papers accepted at #NeurIP2025! 🎉 congrats to my amazing co-authors!
More details (and more bragging) soon! and maybe even more news on sep 25 👀
See you all in… Mexico? San Diego? Copenhagen? Who knows! 🌍✈️
More details (and more bragging) soon! and maybe even more news on sep 25 👀
See you all in… Mexico? San Diego? Copenhagen? Who knows! 🌍✈️
September 19, 2025 at 9:08 AM
🚀 Excited to have 2 papers accepted at #NeurIP2025! 🎉 congrats to my amazing co-authors!
More details (and more bragging) soon! and maybe even more news on sep 25 👀
See you all in… Mexico? San Diego? Copenhagen? Who knows! 🌍✈️
More details (and more bragging) soon! and maybe even more news on sep 25 👀
See you all in… Mexico? San Diego? Copenhagen? Who knows! 🌍✈️
Reposted by Fazl Barez
🚨 NEW PAPER 🚨: Embodied AI (incl. AI-powered drones, self-driving cars and robots) is here, but policies are lagging. We analyzed the EAI risks and found significant gaps in governance
arxiv.org/pdf/2509.00117
Co-authors Jared Perlo @fbarez.bsky.social Alex Robey & @floridi.bsky.social
1\4
arxiv.org/pdf/2509.00117
Co-authors Jared Perlo @fbarez.bsky.social Alex Robey & @floridi.bsky.social
1\4
September 4, 2025 at 5:51 PM
🚨 NEW PAPER 🚨: Embodied AI (incl. AI-powered drones, self-driving cars and robots) is here, but policies are lagging. We analyzed the EAI risks and found significant gaps in governance
arxiv.org/pdf/2509.00117
Co-authors Jared Perlo @fbarez.bsky.social Alex Robey & @floridi.bsky.social
1\4
arxiv.org/pdf/2509.00117
Co-authors Jared Perlo @fbarez.bsky.social Alex Robey & @floridi.bsky.social
1\4
Reposted by Fazl Barez
Other works have highlighted that CoTs ≠ explainability alphaxiv.org/abs/2025.02 (@fbarez.bsky.social), and that intermediate (CoT) tokens ≠ reasoning traces arxiv.org/abs/2504.09762 (@rao2z.bsky.social).
Here, FUR offers a fine-grained test if LMs latently used information from CoTs for answers!
Here, FUR offers a fine-grained test if LMs latently used information from CoTs for answers!
Chain-of-Thought Is Not Explainability | alphaXiv
View 3 comments: There should be a balance of both subjective and observable methodologies. Adhering to just one is a fools errand.
alphaxiv.org
August 21, 2025 at 3:21 PM
Other works have highlighted that CoTs ≠ explainability alphaxiv.org/abs/2025.02 (@fbarez.bsky.social), and that intermediate (CoT) tokens ≠ reasoning traces arxiv.org/abs/2504.09762 (@rao2z.bsky.social).
Here, FUR offers a fine-grained test if LMs latently used information from CoTs for answers!
Here, FUR offers a fine-grained test if LMs latently used information from CoTs for answers!
Reposted by Fazl Barez
It is so easy to confuse chain of thought and explainability and in fact in a lot of the media it is presented as if with current LLMs we are allowed to view their actual thought processes. It is not that!
Excited to share our paper: "Chain-of-Thought Is Not Explainability"! We unpack a critical misconception in AI: models explaining their steps (CoT) aren't necessarily revealing their true reasoning. Spoiler: the transparency can be an illusion. (1/9) 🧵
July 2, 2025 at 12:41 PM
It is so easy to confuse chain of thought and explainability and in fact in a lot of the media it is presented as if with current LLMs we are allowed to view their actual thought processes. It is not that!
Excited to share our paper: "Chain-of-Thought Is Not Explainability"! We unpack a critical misconception in AI: models explaining their steps (CoT) aren't necessarily revealing their true reasoning. Spoiler: the transparency can be an illusion. (1/9) 🧵
July 1, 2025 at 3:41 PM
Excited to share our paper: "Chain-of-Thought Is Not Explainability"! We unpack a critical misconception in AI: models explaining their steps (CoT) aren't necessarily revealing their true reasoning. Spoiler: the transparency can be an illusion. (1/9) 🧵
Technology = power. AI is reshaping power — fast.
Today’s AI doesn’t just assist decisions; it makes them. Governments use it for surveillance, prediction, and control — often with no oversight.
Technical safeguards aren’t enough on their own — but they’re essential for AI to serve society.
Today’s AI doesn’t just assist decisions; it makes them. Governments use it for surveillance, prediction, and control — often with no oversight.
Technical safeguards aren’t enough on their own — but they’re essential for AI to serve society.
June 27, 2025 at 8:07 AM
Technology = power. AI is reshaping power — fast.
Today’s AI doesn’t just assist decisions; it makes them. Governments use it for surveillance, prediction, and control — often with no oversight.
Technical safeguards aren’t enough on their own — but they’re essential for AI to serve society.
Today’s AI doesn’t just assist decisions; it makes them. Governments use it for surveillance, prediction, and control — often with no oversight.
Technical safeguards aren’t enough on their own — but they’re essential for AI to serve society.
Reposted by Fazl Barez
And Anna Yelizarov, @fbarez.bsky.social, @scasper.bsky.social, Beatrice Erkers, among others.
We'll draw from political theory, cooperative AI, economics, mechanism design, history, and hierarchical agency.
We'll draw from political theory, cooperative AI, economics, mechanism design, history, and hierarchical agency.
June 18, 2025 at 6:12 PM
And Anna Yelizarov, @fbarez.bsky.social, @scasper.bsky.social, Beatrice Erkers, among others.
We'll draw from political theory, cooperative AI, economics, mechanism design, history, and hierarchical agency.
We'll draw from political theory, cooperative AI, economics, mechanism design, history, and hierarchical agency.
Reposted by Fazl Barez
This is a step toward targeted, interpretable, and robust knowledge removal — at the parameter level.
Joint work with Clara Suslik, Yihuai Hong, and @fbarez.bsky.social, advised by @megamor2.bsky.social
🔗 Paper: arxiv.org/abs/2505.22586
🔗 Code: github.com/yoavgur/PISCES
Joint work with Clara Suslik, Yihuai Hong, and @fbarez.bsky.social, advised by @megamor2.bsky.social
🔗 Paper: arxiv.org/abs/2505.22586
🔗 Code: github.com/yoavgur/PISCES
May 29, 2025 at 4:22 PM
This is a step toward targeted, interpretable, and robust knowledge removal — at the parameter level.
Joint work with Clara Suslik, Yihuai Hong, and @fbarez.bsky.social, advised by @megamor2.bsky.social
🔗 Paper: arxiv.org/abs/2505.22586
🔗 Code: github.com/yoavgur/PISCES
Joint work with Clara Suslik, Yihuai Hong, and @fbarez.bsky.social, advised by @megamor2.bsky.social
🔗 Paper: arxiv.org/abs/2505.22586
🔗 Code: github.com/yoavgur/PISCES
Come work with me at Oxford this summer! Paid research opportunity to:
White-box LLMs & model security
Safe RL & reward hacking
Interpretability & governance tools
Remote or Oxford.
Apply by 30 May 23:59 UTC. DM with questions.
White-box LLMs & model security
Safe RL & reward hacking
Interpretability & governance tools
Remote or Oxford.
Apply by 30 May 23:59 UTC. DM with questions.
May 20, 2025 at 5:13 PM
Come work with me at Oxford this summer! Paid research opportunity to:
White-box LLMs & model security
Safe RL & reward hacking
Interpretability & governance tools
Remote or Oxford.
Apply by 30 May 23:59 UTC. DM with questions.
White-box LLMs & model security
Safe RL & reward hacking
Interpretability & governance tools
Remote or Oxford.
Apply by 30 May 23:59 UTC. DM with questions.
Come work with me at Oxford!
We’re hiring a Postdoc in Causal Systems Modelling to:
- Build causal & white-box models that make frontier AI safer and more transparent
- Turn technical insights into safety cases, policy briefs, and governance tools
]
DM if you have any questions.
We’re hiring a Postdoc in Causal Systems Modelling to:
- Build causal & white-box models that make frontier AI safer and more transparent
- Turn technical insights into safety cases, policy briefs, and governance tools
]
DM if you have any questions.
May 15, 2025 at 11:12 AM
Come work with me at Oxford!
We’re hiring a Postdoc in Causal Systems Modelling to:
- Build causal & white-box models that make frontier AI safer and more transparent
- Turn technical insights into safety cases, policy briefs, and governance tools
]
DM if you have any questions.
We’re hiring a Postdoc in Causal Systems Modelling to:
- Build causal & white-box models that make frontier AI safer and more transparent
- Turn technical insights into safety cases, policy briefs, and governance tools
]
DM if you have any questions.
First-time Area Chair seeking advice! What helped you most when evaluating papers beyond just averaging scores?
After suffering through unhelpful reviews as an author, I want to do right by papers in my track.
After suffering through unhelpful reviews as an author, I want to do right by papers in my track.
April 8, 2025 at 11:59 AM
First-time Area Chair seeking advice! What helped you most when evaluating papers beyond just averaging scores?
After suffering through unhelpful reviews as an author, I want to do right by papers in my track.
After suffering through unhelpful reviews as an author, I want to do right by papers in my track.
Reposted by Fazl Barez
🎉 Our Actionable Interpretability workshop has been accepted to #ICML2025! 🎉
> Follow @actinterp.bsky.social
> Website actionable-interpretability.github.io
@talhaklay.bsky.social @anja.re @mariusmosbach.bsky.social @sarah-nlp.bsky.social @iftenney.bsky.social
Paper submission deadline: May 9th!
> Follow @actinterp.bsky.social
> Website actionable-interpretability.github.io
@talhaklay.bsky.social @anja.re @mariusmosbach.bsky.social @sarah-nlp.bsky.social @iftenney.bsky.social
Paper submission deadline: May 9th!
March 31, 2025 at 4:59 PM
🎉 Our Actionable Interpretability workshop has been accepted to #ICML2025! 🎉
> Follow @actinterp.bsky.social
> Website actionable-interpretability.github.io
@talhaklay.bsky.social @anja.re @mariusmosbach.bsky.social @sarah-nlp.bsky.social @iftenney.bsky.social
Paper submission deadline: May 9th!
> Follow @actinterp.bsky.social
> Website actionable-interpretability.github.io
@talhaklay.bsky.social @anja.re @mariusmosbach.bsky.social @sarah-nlp.bsky.social @iftenney.bsky.social
Paper submission deadline: May 9th!
Reposted by Fazl Barez
Organizers: Ben Bucknall, @lisasoder.bsky.social, @ankareuel.bsky.social @fbarez.bsky.social, @carlosmougan.bsky.social
Weiwei Pan, Siddharth Swaroop, @ankareuel.bsky.social , Robert Trager @maosbot.bsky.social
Weiwei Pan, Siddharth Swaroop, @ankareuel.bsky.social , Robert Trager @maosbot.bsky.social
April 1, 2025 at 2:58 PM
Organizers: Ben Bucknall, @lisasoder.bsky.social, @ankareuel.bsky.social @fbarez.bsky.social, @carlosmougan.bsky.social
Weiwei Pan, Siddharth Swaroop, @ankareuel.bsky.social , Robert Trager @maosbot.bsky.social
Weiwei Pan, Siddharth Swaroop, @ankareuel.bsky.social , Robert Trager @maosbot.bsky.social
Technical AI Governance (TAIG) at #ICML2025 this July in Vancouver!
Credit to
Ben and Lisa for all the work!
We have a new centre at Oxford working on technical AI governance with Robert Trager and @maosbot.bsky.social many other great minds. We are hiring - please reach out!
Quote
Credit to
Ben and Lisa for all the work!
We have a new centre at Oxford working on technical AI governance with Robert Trager and @maosbot.bsky.social many other great minds. We are hiring - please reach out!
Quote
📣We’re thrilled to announce the first workshop on Technical AI Governance (TAIG) at #ICML2025 this July in Vancouver! Join us (& this stellar list of speakers) in bringing together technical & policy experts to shape the future of AI governance! www.taig-icml.com
April 1, 2025 at 3:10 PM
Technical AI Governance (TAIG) at #ICML2025 this July in Vancouver!
Credit to
Ben and Lisa for all the work!
We have a new centre at Oxford working on technical AI governance with Robert Trager and @maosbot.bsky.social many other great minds. We are hiring - please reach out!
Quote
Credit to
Ben and Lisa for all the work!
We have a new centre at Oxford working on technical AI governance with Robert Trager and @maosbot.bsky.social many other great minds. We are hiring - please reach out!
Quote
Reposted by Fazl Barez
Life update: I'm starting as faculty at Boston University
@bucds.bsky.social in 2026! BU has SCHEMES for LM interpretability & analysis, I couldn't be more pumped to join a burgeoning supergroup w/ @najoung.bsky.social @amuuueller.bsky.social. Looking for my first students, so apply and reach out!
@bucds.bsky.social in 2026! BU has SCHEMES for LM interpretability & analysis, I couldn't be more pumped to join a burgeoning supergroup w/ @najoung.bsky.social @amuuueller.bsky.social. Looking for my first students, so apply and reach out!
March 27, 2025 at 2:24 AM
Life update: I'm starting as faculty at Boston University
@bucds.bsky.social in 2026! BU has SCHEMES for LM interpretability & analysis, I couldn't be more pumped to join a burgeoning supergroup w/ @najoung.bsky.social @amuuueller.bsky.social. Looking for my first students, so apply and reach out!
@bucds.bsky.social in 2026! BU has SCHEMES for LM interpretability & analysis, I couldn't be more pumped to join a burgeoning supergroup w/ @najoung.bsky.social @amuuueller.bsky.social. Looking for my first students, so apply and reach out!
Reposted by Fazl Barez
New paper alert!
Curious how small prompt tweaks impact LLM accuracy but don’t want to run endless inferences? We got you. Meet DOVE - a dataset built to uncover these sensitivities.
Use DOVE for your analysis or contribute samples -we're growing and welcome you aboard!
Curious how small prompt tweaks impact LLM accuracy but don’t want to run endless inferences? We got you. Meet DOVE - a dataset built to uncover these sensitivities.
Use DOVE for your analysis or contribute samples -we're growing and welcome you aboard!
Care about LLM evaluation? 🤖 🤔
We bring you ️️🕊️ DOVE a massive (250M!) collection of LLMs outputs
On different prompts, domains, tokens, models...
Join our community effort to expand it with YOUR model predictions & become a co-author!
We bring you ️️🕊️ DOVE a massive (250M!) collection of LLMs outputs
On different prompts, domains, tokens, models...
Join our community effort to expand it with YOUR model predictions & become a co-author!
March 17, 2025 at 4:33 PM
New paper alert!
Curious how small prompt tweaks impact LLM accuracy but don’t want to run endless inferences? We got you. Meet DOVE - a dataset built to uncover these sensitivities.
Use DOVE for your analysis or contribute samples -we're growing and welcome you aboard!
Curious how small prompt tweaks impact LLM accuracy but don’t want to run endless inferences? We got you. Meet DOVE - a dataset built to uncover these sensitivities.
Use DOVE for your analysis or contribute samples -we're growing and welcome you aboard!
Reposted by Fazl Barez
What happens once AI can design better AI, which can itself design better AI? Will we get an "intelligence explosion" where AI capabilities increase very rapidly? Tom Davidson, Rose Hadshar and I have a new paper out with analysis of these dynamics.
March 17, 2025 at 2:54 PM
What happens once AI can design better AI, which can itself design better AI? Will we get an "intelligence explosion" where AI capabilities increase very rapidly? Tom Davidson, Rose Hadshar and I have a new paper out with analysis of these dynamics.
Reposted by Fazl Barez
My group @FLAIR_Ox is recruiting a postdoc and looking for someone who can get started by the end of April. Deadline to apply is in one week (!), 19th of March at noon, so please help spread the word: my.corehr.com/pls/uoxrecru...
Job Details
my.corehr.com
March 12, 2025 at 3:17 PM
My group @FLAIR_Ox is recruiting a postdoc and looking for someone who can get started by the end of April. Deadline to apply is in one week (!), 19th of March at noon, so please help spread the word: my.corehr.com/pls/uoxrecru...
Reposted by Fazl Barez
1/13 LLM circuits tell us where the computation happens inside the model—but the computation varies by token position, a key detail often ignored!
We propose a method to automatically find position-aware circuits, improving faithfulness while keeping circuits compact. 🧵👇
We propose a method to automatically find position-aware circuits, improving faithfulness while keeping circuits compact. 🧵👇
March 6, 2025 at 10:15 PM
1/13 LLM circuits tell us where the computation happens inside the model—but the computation varies by token position, a key detail often ignored!
We propose a method to automatically find position-aware circuits, improving faithfulness while keeping circuits compact. 🧵👇
We propose a method to automatically find position-aware circuits, improving faithfulness while keeping circuits compact. 🧵👇
🔍 Excited to share our paper: "Same Question, Different Words: A Latent Adversarial Framework for Prompt Robustness"!
March 4, 2025 at 5:24 PM
🔍 Excited to share our paper: "Same Question, Different Words: A Latent Adversarial Framework for Prompt Robustness"!
New paper alert! 🚨
Important question: Do SAEs generalise?
We explore the answerability detection in LLMs by comparing SAE features vs. linear residual stream probes.
Answer:
probes outperform SAE features in-domain, out-of-domain generalization varies sharply between features and datasets. 🧵
Important question: Do SAEs generalise?
We explore the answerability detection in LLMs by comparing SAE features vs. linear residual stream probes.
Answer:
probes outperform SAE features in-domain, out-of-domain generalization varies sharply between features and datasets. 🧵
March 1, 2025 at 6:14 PM
New paper alert! 🚨
Important question: Do SAEs generalise?
We explore the answerability detection in LLMs by comparing SAE features vs. linear residual stream probes.
Answer:
probes outperform SAE features in-domain, out-of-domain generalization varies sharply between features and datasets. 🧵
Important question: Do SAEs generalise?
We explore the answerability detection in LLMs by comparing SAE features vs. linear residual stream probes.
Answer:
probes outperform SAE features in-domain, out-of-domain generalization varies sharply between features and datasets. 🧵
Reposted by Fazl Barez
🚨New arXiv preprint!🚨
LLMs can hallucinate - but did you know they can do so with high certainty even when they know the correct answer? 🤯
We find those hallucinations in our latest work with @itay-itzhak.bsky.social, @fbarez.bsky.social, @gabistanovsky.bsky.social and Yonatan Belinkov
LLMs can hallucinate - but did you know they can do so with high certainty even when they know the correct answer? 🤯
We find those hallucinations in our latest work with @itay-itzhak.bsky.social, @fbarez.bsky.social, @gabistanovsky.bsky.social and Yonatan Belinkov
February 19, 2025 at 3:50 PM
🚨New arXiv preprint!🚨
LLMs can hallucinate - but did you know they can do so with high certainty even when they know the correct answer? 🤯
We find those hallucinations in our latest work with @itay-itzhak.bsky.social, @fbarez.bsky.social, @gabistanovsky.bsky.social and Yonatan Belinkov
LLMs can hallucinate - but did you know they can do so with high certainty even when they know the correct answer? 🤯
We find those hallucinations in our latest work with @itay-itzhak.bsky.social, @fbarez.bsky.social, @gabistanovsky.bsky.social and Yonatan Belinkov
Reposted by Fazl Barez
We are excited to welcome Fazl Barez @fbarez.bsky.social, who joins us as a senior postdoctoral research fellow. He will be leading research initiatives in AI safety and interpretability.
@oxmartinschool.bsky.social
Find out more: www.oxfordmartin.ox.ac.uk/people/fazl-...
@oxmartinschool.bsky.social
Find out more: www.oxfordmartin.ox.ac.uk/people/fazl-...
February 18, 2025 at 3:37 PM
We are excited to welcome Fazl Barez @fbarez.bsky.social, who joins us as a senior postdoctoral research fellow. He will be leading research initiatives in AI safety and interpretability.
@oxmartinschool.bsky.social
Find out more: www.oxfordmartin.ox.ac.uk/people/fazl-...
@oxmartinschool.bsky.social
Find out more: www.oxfordmartin.ox.ac.uk/people/fazl-...