Kyle O’Brien
banner
kyletokens.bsky.social
Kyle O’Brien
@kyletokens.bsky.social
studying the minds on our computers | https://kyobrien.io
Applications to apply for the ERA:AI Fellowship close November 3rd! Participating in this Summer's fellowship was my gateway into pursuing AGI safety research full-time. I will be a research manager for the upcoming Winter fellowships. Feel free to DM me with questions. :) erafellowship.org
ERA Fellowship
ERA is a talent programme supporting early-career researchers and entrepreneurs to understand and mitigate risks from frontier AI, based at Cambridge, UK.
erafellowship.org
October 31, 2025 at 3:45 PM
Reposted by Kyle O’Brien
📌📌📌
I'm excited to be on the faculty job market this fall. I just updated my website with my CV.
stephencasper.com
Stephen Casper
Visit the post for more.
stephencasper.com
September 4, 2025 at 3:39 AM
Reposted by Kyle O’Brien
Thanks to @stellaathena.bsky.social for chatting with me about Deep Ignorance: the new paper/project from Eleuther AI and the UK AISI. Bottom line: Worried AI could teach people to build bioweapons? Don’t teach it how

fortune.com/2025/08/14/w...
AI safety tip: if you don’t want it giving bioweapon instructions, maybe don’t put them in the training data, say researchers
New research shows that scrubbing risky material from AI training data can build safeguards that are harder to bypass — and one author calls out tech giants for keeping such work under wraps.
fortune.com
August 15, 2025 at 3:33 AM
Author here! Data filtering is resistant to tampering, but not fully robust. We expect that a high-resource attacker can still teach the model the filtered knowledge. Our work is a significant improvement over the baselines, but far more work is needed.
August 15, 2025 at 2:20 PM
This articles covers our work for a general audience. :)
August 12, 2025 at 10:07 PM
Big and True :)
🧵 New paper from UK AISI x @eleutherai.bsky.social rai.bsky.social‬ that I led with @kyletokens.bsky.social y.social���:

Open-weight LLM safety is both important & neglected. But filtering dual-use knowledge from pre-training data improves tamper resistance *>10x* over post-training baselines.
August 12, 2025 at 12:55 PM
I like that OpenAI published this. They were able to fine-tune away GPT-oss's refusal, decreasing refusal rates to ~0%. These results aren't surprising. Acknowledging that existing safeguards don't generalize to open models is the first step in developing solutions.
arxiv.org/abs/2508.031...
Estimating Worst-Case Frontier Risks of Open-Weight LLMs
In this paper, we study the worst-case frontier risks of releasing gpt-oss. We introduce malicious fine-tuning (MFT), where we attempt to elicit maximum capabilities by fine-tuning gpt-oss to be as ca...
arxiv.org
August 10, 2025 at 1:45 PM
I've learned a lot over the past two years of getting into research, mostly from mistakes. I’ve made many mistakes. Such is science. Good research is often at the adjacent possible. I've written up much of what I've learned now that I'm beginning to mentor others. open.substack.com/pub/kyletoke...
Don’t "Think", Just Think
Lessons From Breaking Into AI Research
open.substack.com
August 2, 2025 at 5:49 PM
I led an effort at Microsoft last Fall that studied whether SAE steering was an effective way to improve jailbreak robustness. Our paper on SAE steering has been accepted to the ICML Actionable Interpretability Workshop!

Venue: actionable-interpretability.github.io
Paper: arxiv.org/abs/2411.11296
Steering Language Model Refusal with Sparse Autoencoders
Responsible deployment of language models requires mechanisms for refusing unsafe prompts while preserving model performance. While most approaches modify model weights through additional training, we...
arxiv.org
June 20, 2025 at 6:10 PM
I'll be in England this summer as an AI Safety Research Fellow with ERA! erafellowship.org/fellowship

I will be studying data filtering and tamper-resistant unlearning for open-weight AI safety so that the community can continue to benefit from open models as capabilities improve.
Fellowship — ERA Fellowship
erafellowship.org
June 7, 2025 at 1:17 AM