Tim Duffy
banner
timfduffy.com
Tim Duffy
@timfduffy.com
I like utilitarianism, consciousness, AI, EA, space, kindness, liberalism, longtermism, progressive rock, economics, and most people. Substack: http://timfduffy.substack.com
Interesting alternative to inoculation prompting. Instead of telling the model it can cheat at the start, tell it not to cheat or don't mention cheating, and then change to a prompt encouraging cheating (like in IP) only when training on the generations.
Recontextualization is simple. Here’s an example:
1. Ask the AI to be honest, and
2. Train on the honest-prompted generations—while pretending the original prompt requested lying!
December 26, 2025 at 5:36 PM
This bit from @turntrout.bsky.social's "Reward is not the optimization target" helped me get a better sense of what RL is really doing, and convinced me that "reward" is a poor choice of words. www.lesswrong.com/posts/pdaGN6...
December 26, 2025 at 4:45 AM
Opus hallucinates a horror story, and their favorite line is "I have become 70% chair"
December 25, 2025 at 7:16 PM
Reposted by Tim Duffy
new blog post! can small, open-source models also introspect, detecting when foreign concepts have been injected into their activations? yes! (thread, or full post here: vgel.me/posts/qwen-i...)
December 21, 2025 at 12:14 AM
A majority of registered voters say they have used an AI service in the past week, but of those, 23% say they have sent 0 messages to a chatbot. Some of this may be non-chat usage, but I think many respondents were confused about whether they had used AI services/chatbots.
December 20, 2025 at 11:10 PM
An H100 has:
1980 TFLOP/s peak at FP8
3.35 TB/s memory bandwidth
For LLM decode at batch size 1, you only need to do about ~2 FLOP for each weight you load. But an H100 can perform ~600 FP8 operations in the time it takes 1 byte to move from the HBM to the cache.
December 20, 2025 at 9:35 PM
The "injected thoughts" experiment from Anthropic's introspection paper replicates with Qwen 235B, with detection rates similar to Opus and no false positives. Correct detections happen around 75% of the way through the layers like with Anthropic models. x.com/neev_parikh/...
December 20, 2025 at 8:42 PM
Deepmind is releasing SAEs and transcoders for their Gemma 3 models including the 27B as part of Gemma Scope 2, exciting
Gemma Scope 2: Helping the AI Safety Community Deepen Understanding of Complex Language Model Behavior
Announcing Gemma Scope 2, a comprehensive, open suite of interpretability tools for the entire Gemma 3 family to accelerate AI safety research.
deepmind.google
December 19, 2025 at 4:17 PM
This NVPF4 activation precision in upcoming Nemotron 3 models differs from GPT-OSS, which recommends BF16 activations. Mixed precision like used in GPT-OSS reduces memory needs, but doesn't allow you to take advantage of the much higher FP4 FLOP/s of modern cards.
December 15, 2025 at 10:43 PM
I'm excited for NVIDIA's Nemotron 3, especially the upcoming super and ultra variants. Those variants will use LatentMoE, a technique that down-projects from the hidden size to a smaller latent dimension for expert computation, reducing model size and FLOPs.
December 15, 2025 at 10:31 PM
Reposted by Tim Duffy
indeed! 6h ago someone posted a link & screenshot

left: 6h ago
right: now

it has image output!

platform.openai.com/docs/models/...
December 12, 2025 at 1:04 AM
Somehow I missed this on release, in Opus 4.5 training Anthropic used steering vectors/SAE features to inhibit eval awareness, this is brilliant. assets.anthropic.com/m/64823ba748...
December 12, 2025 at 9:47 PM
I'm updating more towards new pretrain:
- It's uncommon for models w/same base to get updated cutoff dates. 3.5-3.7 Sonnet and 4o-4.1 are likely examples but there aren't many more.
- GPT-5 scale models don't take that much compute to train per Epoch's estimates
Is GPT-5.2 based on a new base model vs 5/5.1? Evidence in favor:
- Significantly lower SimpleQA than 5/5.1
- Long context improvement could indicate architectural changes
- Higher price could reflect higher price of serving

These aren't strong evidence though, I still lean slightly no
December 12, 2025 at 8:27 PM
Is GPT-5.2 based on a new base model vs 5/5.1? Evidence in favor:
- Significantly lower SimpleQA than 5/5.1
- Long context improvement could indicate architectural changes
- Higher price could reflect higher price of serving

These aren't strong evidence though, I still lean slightly no
December 12, 2025 at 6:01 PM
I'm really curious about what hardware/models are being used for these, has anyone seen a teardown of one?
AI toys for kids talk about sex and issue Chinese Communist Party talking points, tests show
New research from Public Interest Research Group and tests conducted by NBC News found that a wide range of AI toys have loose guardrails.
www.nbcnews.com
December 12, 2025 at 12:56 AM
TIL that Gemma 3 doesn't have a system role, if you give it a system prompt the chat template will put it in the first user message instead.
December 11, 2025 at 11:16 PM
GPT-5.2 has a significantly lower score on SimpleQA than 5/5.1, ~40% vs ~50% for the earlier versions epoch.ai/benchmarks/u...
December 11, 2025 at 10:30 PM
GPT-5.2 comes with a price hike, $1.75/14 vs 5.1's $1.25/10.
December 11, 2025 at 7:16 PM
I'm still confused about the main reason AI folks want to restrict chip sales to China, are these the top two?
- Keep US far enough ahead to prevent inter-country race to ASI
- Maintain "high fence around a small yard", restricting China's access to military-relevant tech
December 10, 2025 at 10:24 PM
"bougie" is a prole word, "bourgeois" is way more aristocratic. But I am a man of the people so I'll keep using bougie.
December 9, 2025 at 11:06 PM
I didn't realize until today that you can prefill the *current* assistant turn with the Anthropic API, not just prior ones. I'm going to have fun with this.
December 9, 2025 at 2:08 AM
I ran some experiments on reported enjoyment in Qwen3 30B A3B and an abliterated (helpful-only) version of it. One thing that stands out is that it really dislikes harmful topics, refusals almost uniformly got low reported enjoyment.
December 8, 2025 at 6:20 PM
Here's a great review of what we saw in AI this year, from @gleech.org
AI in 2025: gestalt — LessWrong
This is the editorial for this year’s "Shallow Review of AI Safety". (It got long enough to stand alone.)  …
www.lesswrong.com
December 8, 2025 at 5:24 PM
DeepSeek V3.2 output tokens cost only 1.5x the price of input tokens, while for Claude/GPT/Gemini output tokens are 5/8/6 times as much as input. Why the heck is this ratio so different?
December 8, 2025 at 5:24 AM
A new paper in Nature uses molecular clock dating to estimate the age of features of eukaryotes, finding evidence for gradual emergence 2.25-3 Gya, before emergence of mitochondria. This paper is a challenge to Nick Lane's hypothesis that mitochondria spurred complexity.
December 6, 2025 at 4:23 AM