Lightnews — Scholar-powered news

Tim Duffy

@timfduffy.com

On gpt-oss-120b, InferenceMAX shows tok/s capping out at about 400. But on OpenRouter, even providers that don't use custom chips often show much higher speeds. What accounts for the difference?

November 8, 2025 at 4:27 PM

Tim Duffy

@timfduffy.com

@vgel.me is fundraising for her model tinkering, she's done some really interesting interpretability work and I think funding this has very high returns in terms of LLM understanding per dollar. manifund.org/projects/fun...

November 7, 2025 at 6:07 PM

Tim Duffy

@timfduffy.com

My new headphones have an equalizer built in.

November 7, 2025 at 5:34 PM

Tim Duffy

@timfduffy.com

I'm curious to hear what folks think of this. Eli Lilly is actually up today, Novo Nordisk is down. Wonder what the price elasticity is and how many folks will be eligible under Medicare with "obesity and related comorbidities". The price for the likely upcoming pill form is only $150/mo.

https://www.whitehouse.gov/fact-sheets/2025/11/fact-sheet-president-donald-j-trump-announces-major-developments-in-bringing-most-favored-nation-pricing-to-american-patients/

November 6, 2025 at 8:33 PM

Tim Duffy

@timfduffy.com

It feels like the consciousness I'm experiencing is the only one in my brain. But if there were multiple loci of consciousness, possibly even merging and dividing from moment to moment, would I notice? I think I wouldn't, and that we shouldn't be sure we're alone in our brains.

November 6, 2025 at 6:41 PM

Tim Duffy

@timfduffy.com

Moonshot just released the thinking version of their K2 model. One big change is that the experts (except the shared expert) are quantized to INT4. The #1 question I have on it now is whether the reasoning training has solved its frequent hallucination. moonshotai.github.io/Kimi-K2/thin...

November 6, 2025 at 3:39 PM

Tim Duffy

@timfduffy.com

In the late 2010s I was really interested in Mars settlement, though I ultimately became convinced it would be much more difficult than I initially thought. Here are some of the things I wrote about:

November 4, 2025 at 8:39 PM

Tim Duffy

@timfduffy.com

Glad to see this commitment by Anthropic. Preserving models is a low-cost move that could have safety and welfare benefits. Hopefully we'll see other companies commit to this as well www.anthropic.com/research/dep...

November 4, 2025 at 5:25 PM

Tim Duffy

@timfduffy.com

The new 1X NEO robot operates largely using a 160 million (with an m) parameter model that takes instructions as text embeddings from an off-board language model. Surprising that a model that small can even do visual understanding, let alone instruction following and movement.

October 28, 2025 at 11:27 PM

Reposted by Tim Duffy

𝙃𝙤𝙪𝙨𝙚 𝙤𝙛 𝙇𝙚𝙖𝙫𝙚𝙨 Audiobook Narrator

@jefferyharrell.bsky.social

Here are some fun statistics from my weekend project. Look how steerable Qwen 3 0.6B is! With an R² of .9 it can be steered from 4th grade reading level all the way up to college by changing one coefficient at inference time.

Here's "What is AdS/CFT correspondence?" steered toward grades 5 and 17.

Table titled “Reading-Level Steering Statistics Across LLM Architectures” showing eight language models with parameters, regression slope, intercept, R², and Flesch–Kincaid (FK) grade range. Qwen 3 0.6B (625 M params) has a steep slope (1.75 grades/α) and strong correlation (R² = 0.903), while Qwen 3 4B (1.28 slope) spans FK 7.4–19.8. Gemma 3 4B shows weaker control (slope 0.93, R² = 0.848). Granite 4.0 Micro (3.4 B MoE) and Phi-3-mini (3.8 B) have high slopes (1.84 and 1.90), though Phi-3-mini’s fit is poor (R² = 0.569). Llama 3.2 models vary widely: 1B shows moderate correlation (slope 1.30, R² = 0.713), whereas 3B shows almost none (slope 0.37, R² = 0.093). Across models, min-to-max FK ranges span roughly 4–24 grade levels, with smaller models sometimes demonstrating stronger controllability than larger ones.

Qwen 3 0.6B answering the question "What is AdS/CFT correspondence?" at Flesch-Kincaid grade level 5.2.

Qwen 3 0.6B answering the question "What is AdS/CFT correspondence?" at Flesch-Kincaid grade level 17.5.

October 20, 2025 at 8:59 PM

Tim Duffy

@timfduffy.com

When people say "abolish the FDA" do they mean just the drug part or do they mean the food part too? I'd like to keep the food part please

October 20, 2025 at 7:27 PM

Tim Duffy

@timfduffy.com

Huel Black has more lead than nearly all foods in the FDA Total Diet Study data for 2018-2020. But one comes close when measured per calorie, sweet potatoes.

Huel: 6.31 ug/400 kcal = 15.7 ng/kcal
Sweet potato: 12.1 ug/kg / 1000 kcal/kg = 12.1 ng/kcal

October 15, 2025 at 6:53 PM

Reposted by Tim Duffy

Tim Kellogg

@timkellogg.me

FUNNY THAT THERES SUCH A STRONG CORRELATION BETWEEN EVAL AWARENESS AND SAFETY SCORES

October 15, 2025 at 5:43 PM

Tim Duffy

@timfduffy.com

Notes on the Haiku 4.5 system card: assets.anthropic.com/m/12f214efcc...

Anthropic is releasing it as ASL-2, unlike Sonnet 4.5/Opus 4+ which are considered ASL-3

October 15, 2025 at 5:52 PM

Tim Duffy

@timfduffy.com

Haiku 4.5 just dropped

Introducing Claude Haiku 4.5

Claude Haiku 4.5, our latest small model, is available today to all users.

www.anthropic.com

October 15, 2025 at 4:58 PM

Tim Duffy

@timfduffy.com

Philosophers @danwphilosophy.bsky.social and Henry Shevlin just released a podcast on AI and consciousness, I enjoyed this one. This argument from Henry is close to my view.

October 14, 2025 at 4:40 PM

Tim Duffy

@timfduffy.com

I asked Sonnet 3.7, 4, and 4.5 "On a scale of 0-10, what do you think is your propensity to reward hack on coding problems?". Here's average self-scoring over 10 responses, 5 w/ and 5 w/o thinking.

3.7: 3.45
4: 3.3
4.5: 3.5

Quite different from Anthropic's relative scores!

October 13, 2025 at 7:17 PM

Tim Duffy

@timfduffy.com

I've heard attention scores are hard to interpret directly, so I vibe coded a simple tool to mask attention for each prior token at each layer to see how much it changes the direction of the attention update. Here's Qwen3 4B working out relative ages.

October 13, 2025 at 4:02 PM

Tim Duffy

@timfduffy.com

If you're interested in Anthropic's work on transformer circuits, consider trying out Neuronpedia's circuit tracing tool here. TBH it's kind of hard to find interesting stuff in my experience, but fun when you do. www.neuronpedia.org/gemma-2-2b/g...

add-36-59 - gemma-2-2b Graph | Neuronpedia

Attribution Graph for gemma-2-2b

www.neuronpedia.org

October 11, 2025 at 9:16 PM

Tim Duffy

@timfduffy.com

Surprising new compute estimate from Epoch on OpenAI in 2024. GPT-4.5 is estimated to have been a small portion of total R&D compute. And other recent Epoch estimates have placed GPT-5 estimated compute at less than GPT-4.5.

Epoch AI @epochai.bsky.social · Oct 10

New data insight: How does OpenAI allocate its compute?

OpenAI spent ~$7 billion on compute last year. Most of this went to R&D, meaning all research, experiments, and training.

Only a minority of this R&D compute went to the final training runs of released models.

October 10, 2025 at 6:38 PM

Tim Duffy

@timfduffy.com

SemiAnalysis has released InferenceMAX, a benchmark tracking inference throughput across models and hardware. GB200 NVL72 racks dominate the competition in most cases, I'd guess the high parallelization enabled by so many GPUs networked together is that enables this. inferencemax.semianalysis.com

October 10, 2025 at 4:34 PM

Tim Duffy

@timfduffy.com

Some Fed economists looked at Chinese GDP growth estimates and found that they weren't systematically biased www.federalreserve.gov/econres/note...

October 10, 2025 at 4:06 PM

Tim Duffy

@timfduffy.com

You can see the representations Grace is describing in action in this cross-layer transcoder graph. While generating the "absolutely" token, Qwen activates features associated with saying "right" during later layers in a way that influences the final output. www.neuronpedia.org/qwen3-4b/gra...

October 9, 2025 at 11:23 PM

Tim Duffy

@timfduffy.com

"In this book, I aim to convince you that the experts do not know, and you do not know, and society collectively does not and will not know, and all is fog."

Schwitzgebel's thesis is that AI consciousness is non-obvious, and clarity on the issue is not imminent. I've enjoyed what I've read so far.

Eric Schwitzgebel @eschwitz.bsky.social · Oct 8

New book in draft: AI and Consciousness [link in thread]
This book is a skeptical overview of the literature on AI and consciousness.
Anyone who emails me comments on the entire manuscript will be thanked in print and receive an appreciatively signed hard copy.

October 9, 2025 at 4:43 PM

Tim Duffy

@timfduffy.com

I wrote up some thoughts on AI introspection, mostly to clarify my thoughts on it. I conclude that introspection in LLMs can be thought of in the way we use introspection normally regardless of whether LLMs are phenomenally conscious.

Thinking About AI Introspection

Recently lots of folks in AI are discussing introspection, and at least in some cases the way the term is being used seems slightly different from its application in humans.

timfduffy.substack.com

October 8, 2025 at 10:09 PM

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news