Author | Lightnews

Kyle O’Brien

@kyletokens.bsky.social

Applications to apply for the ERA:AI Fellowship close November 3rd! Participating in this Summer's fellowship was my gateway into pursuing AGI safety research full-time. I will be a research manager for the upcoming Winter fellowships. Feel free to DM me with questions. :) erafellowship.org

ERA Fellowship

ERA is a talent programme supporting early-career researchers and entrepreneurs to understand and mitigate risks from frontier AI, based at Cambridge, UK.

erafellowship.org

October 31, 2025 at 3:45 PM

Reposted by Kyle O’Brien

Cas (Stephen Casper)

@scasper.bsky.social

📌📌📌
I'm excited to be on the faculty job market this fall. I just updated my website with my CV.
stephencasper.com

Stephen Casper

Visit the post for more.

stephencasper.com

September 4, 2025 at 3:39 AM

Reposted by Kyle O’Brien

Sharon Goldman

@sharongoldman.bsky.social

Thanks to @stellaathena.bsky.social for chatting with me about Deep Ignorance: the new paper/project from Eleuther AI and the UK AISI. Bottom line: Worried AI could teach people to build bioweapons? Don’t teach it how

fortune.com/2025/08/14/w...

AI safety tip: if you don’t want it giving bioweapon instructions, maybe don’t put them in the training data, say researchers

New research shows that scrubbing risky material from AI training data can build safeguards that are harder to bypass — and one author calls out tech giants for keeping such work under wraps.

fortune.com

August 15, 2025 at 3:33 AM

Kyle O’Brien

@kyletokens.bsky.social

Author here! Data filtering is resistant to tampering, but not fully robust. We expect that a high-resource attacker can still teach the model the filtered knowledge. Our work is a significant improvement over the baselines, but far more work is needed.

August 15, 2025 at 2:20 PM

Kyle O’Brien

@kyletokens.bsky.social

This articles covers our work for a general audience. :)

Cas (Stephen Casper) @scasper.bsky.social · Aug 12

"The results suggest policymakers may need to question one of the AI industry’s long-established narratives."

www.washingtonpost.com/newsletter/p...

Analysis | AI systems ‘ignorant’ of sensitive data can be safer, but still smart

The Washington Post’s essential guide to tech policy news.

www.washingtonpost.com

August 12, 2025 at 10:07 PM

Kyle O’Brien

@kyletokens.bsky.social

Big and True :)

Cas (Stephen Casper) @scasper.bsky.social · Aug 12

🧵 New paper from UK AISI x @eleutherai.bsky.social rai.bsky.social‬ that I led with @kyletokens.bsky.social y.social��:

Open-weight LLM safety is both important & neglected. But filtering dual-use knowledge from pre-training data improves tamper resistance *>10x* over post-training baselines.

August 12, 2025 at 12:55 PM

Kyle O’Brien

@kyletokens.bsky.social

I like that OpenAI published this. They were able to fine-tune away GPT-oss's refusal, decreasing refusal rates to ~0%. These results aren't surprising. Acknowledging that existing safeguards don't generalize to open models is the first step in developing solutions.
arxiv.org/abs/2508.031...

Estimating Worst-Case Frontier Risks of Open-Weight LLMs

In this paper, we study the worst-case frontier risks of releasing gpt-oss. We introduce malicious fine-tuning (MFT), where we attempt to elicit maximum capabilities by fine-tuning gpt-oss to be as ca...

arxiv.org

August 10, 2025 at 1:45 PM

Kyle O’Brien

@kyletokens.bsky.social

I've learned a lot over the past two years of getting into research, mostly from mistakes. I’ve made many mistakes. Such is science. Good research is often at the adjacent possible. I've written up much of what I've learned now that I'm beginning to mentor others. open.substack.com/pub/kyletoke...

Don’t "Think", Just Think

Lessons From Breaking Into AI Research

open.substack.com

August 2, 2025 at 5:49 PM

Kyle O’Brien

@kyletokens.bsky.social

I led an effort at Microsoft last Fall that studied whether SAE steering was an effective way to improve jailbreak robustness. Our paper on SAE steering has been accepted to the ICML Actionable Interpretability Workshop!

Venue: actionable-interpretability.github.io
Paper: arxiv.org/abs/2411.11296

Steering Language Model Refusal with Sparse Autoencoders

Responsible deployment of language models requires mechanisms for refusing unsafe prompts while preserving model performance. While most approaches modify model weights through additional training, we...

arxiv.org

June 20, 2025 at 6:10 PM

Kyle O’Brien

@kyletokens.bsky.social

I'll be in England this summer as an AI Safety Research Fellow with ERA! erafellowship.org/fellowship

I will be studying data filtering and tamper-resistant unlearning for open-weight AI safety so that the community can continue to benefit from open models as capabilities improve.

Fellowship — ERA Fellowship

erafellowship.org

June 7, 2025 at 1:17 AM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news