Lightnews — Scholar-powered news

Tim Kellogg

@timkellogg.me

is this where Google overtakes OpenAI?

Andrew Curran
@AndrewCurra….2h g
Google is poised really take off. During the earnings call when asked about Gemini 3
Sundar said 'we are taking the time to put out a notably improved model.' All indications are they have. It looks like they are going to launch Gemini 3, Nano Banana 2 and Veo 4 simultaneously.

November 11, 2025 at 10:46 PM

Tim Kellogg

@timkellogg.me

New York governor’s open letter to AI companies operating in the state

To All Companies Operating Al Companions in New York:
As Governor of New York, I am committed to ensuring that our state leads the nation in shaping the future of artificial intelligence. As we build New York into a pioneer of responsible Al, where innovation can flourish without compromising safety, it is our obligation to guarantee that these technologies transforming our world also protect the people that use them.
Al companions are skyrocketing in popularity. A Report by Common Sense Media showed more than 70% of teens are using Al companions, with many seeking emotional or mental health support from these online chatbots. In a time of unprecedented loneliness, young people turning to Al for friendship without adequate safety standards may be increasingly at risk of poor social development and dire outcomes. We as leaders cannot afford to delay action. The consequences are devastating both for families and society as a whole.
This letter is a reminder that New York's new Al Companion law is now in effect. The new law, General Business Law Article 47, requires companies to put strict safety rules in place so chatbots cannot exacerbate thoughts of suicidal ideation or self- harm expressed by its users. It aims to ensure that tech companies are taking steps to help limit exploitation and prevent potential tragedies linked to Al.
Responsibilities Under the New Law
Effective November 5™, 2025, companies providing companion Al bots must:
• Implement safety protocols for crisis intervention: Al bots must detect signs of suicidal ideation or self-harm and direct users to crisis service providers.
• Interrupt extended use: Clearly and regularly notify users that they are not interacting with a human, including conspicuous notifications at session start and periodic intervals of every three hours of continued companion use.
New York's law mandates intervention protocols for a reason: compliance will help make

November 11, 2025 at 9:12 PM

Tim Kellogg

@timkellogg.me

interesting paper

imo if base models perform the same at high pass@k, then RLVR is just making them better *agents*, bc the reduced error rate translates to long agent trajectories

so while there is limits to RLVR, it’s clearly necessary

limit-of-rlvr.github.io

Base models surpass RL at large pass@k. RL-trained variants shine at low
sampling budgets, yet the original base models consistently overtake them as k increases.

November 11, 2025 at 2:12 PM

Tim Kellogg

@timkellogg.me

oh! teor likes it! congrats “Alexander” @dorialexander.bsky.social 🤣

Teortaxes
@teortaxesTex
(DeepSeek...
X.com
I admit I was skeptical about Alexander's small model obsession, but this is a powerful flex.
Alexander Doria
@Dorialexander • 17h
..•
Along with Baguettotron we release the smallest viable language model to date. Monad, a 56M transformer, trained on the English part of SYNTH with non-random performance on MMLU. Desiging Monad an engineering challenge requiring a custom tiny tokenizer huggingface.co/PleIAs/Monad

November 11, 2025 at 12:17 PM

Tim Kellogg

@timkellogg.me

chat, what are we thinking: quantization or batch size?

Tibo
@thsottiaux • 7h
We've made changes that allows us to serve GPT-5-Codex-Mini at twice the number of tokens per second compared to last Friday, which means it should feel about twice as fast for tasks rely heavily on generating a lot of code.
$ codex -m gpt-5-codex-mini

November 11, 2025 at 12:15 PM

Tim Kellogg

@timkellogg.me

so sweet

❯ uv run run.py
Loaded PleIAs/Baguettotron on mps. Type 'quit' to exit.
> bro! let's fucking go!
I'm really into you. It's a relationship that gets ridiculed at the same time we're supposed to appreciate it. It's funny, but it does a lot to you.

I'm glad you were asking the right questions. Is there anything else you're curious about?

November 11, 2025 at 3:27 AM

Tim Kellogg

@timkellogg.me

while being the most French model yet, they had to rationalize why it wasn’t trained on French

but fr imagine being able to do ablations on THE ENTIRE end-to-end training process. you’d learn so much

Monad and Baguettotron were trained on 16 h100 from Jean Zay using the Nanotron framework from HuggingFace. This setting allowed for fast experimentations and iteration, Monad being trained in less than six hours. While Baguettotron reuses the standard Pleias tokenizer optimized for European languages, Monad uses a custom tokenizer trained on the English segment of SYNTH: this was a critical measure to contain parameters space, bringing back token embeddings from 20M to less than 2M.|

November 11, 2025 at 2:36 AM

Tim Kellogg

@timkellogg.me

sent this to my brother asking, “does this count as wealth redistribution?”

(fun fact: my bro voted for Trump and is also undergoing collapse of the company he’s CEO of due to tariffs)

TKL
The Kobeissi Letter v
@KobeissiLetter • 8h
BREAKING: President Trump announces that he will be paying a "tariff dividend" of at least $2,000 per person.
•••
Stimulus checks are officially back.

November 9, 2025 at 9:36 PM

Tim Kellogg

@timkellogg.me

The town of German, NY elected 2 positions on write-in ballots alone

1. Superintendent of Highways
2. Town Justice

apparently no one ran

A cropped section of an election results report showing two contests from the town of German.

The first section, titled “Superintendent of Highways - German (Vote for 1)”, lists:
• 85 ballots total (0 over voted ballots, 0 overvotes, 66 undervotes).
• “1 precincts reported out of 1 total.”
• Results: Write-in — 19 votes, 100.00%.
• Total — 19 votes, 100.00%.
• Overvotes — 0.
• Undervotes — 66.

The second section, titled “Town Justice - German (Vote for 1)”, lists:
• 85 ballots total (0 over voted ballots, 0 overvotes, 81 undervotes).
• “1 precincts reported out of 1 total.”
• Results: Write-in — 4 votes, 100.00%.
• Total — 4 votes, 100.00%.
• Overvotes — 0.
• Undervotes — 81.

The text is printed in black on a white background with horizontal divider lines separating sections.

November 9, 2025 at 9:06 PM

Tim Kellogg

@timkellogg.me

Polaris Alpha, believed to be GPT-5.1 non-reasoning, scores just below Sonnet 4.5 on HLE (unofficial run)

There will be a reasoning version too, and OpenAI excels at RL & post training, so I have high expectations for it

also leaked: Nov 24 release date

The bar chart titled “Humanity’s Last Exam (HLE) – Text-Only Performance” compares accuracy percentages of three language models:
• Sonnet 4.5 — 7.65% (blue bar)
• Polaris Alpha — 6.0% (red bar)
• GPT 4.5 — 5.8% (teal bar)

The y-axis is labeled Accuracy (%), ranging from 0 to 10.
Sonnet 4.5 achieves the highest score, outperforming both Polaris Alpha and GPT 4.5 on this challenging benchmark.

November 9, 2025 at 2:18 PM

Tim Kellogg

@timkellogg.me

idk is a 50 year mortgage even worth it?

$300,000
30-year fixed: $1,529 principal and interest
40-year fixed: $1,418 principal and interest
50-year fixed: $1,366 principal and interest
$400,000
30-year fixed: $2,038 principal and interest
40-year fixed: $1,891 principal and interest
50-year fixed: $1,822 principal and interest
$500,000
30-year fixed: $2,548 principal and interest
40-year fixed: $2,363 principal and interest
50-year fixed: $2,277 principal and interest

November 8, 2025 at 10:45 PM

Tim Kellogg

@timkellogg.me

pro tip

Malte Ubl
@cramforce • 1d
I told ChatGPT that I'm a CTO and now it dumbs down all the answers to technical questions so l can understand them

November 8, 2025 at 10:20 PM

Tim Kellogg

@timkellogg.me

GPT-5-codex-mini

Almost same performance as GPT-5-codex on high, but 4x faster and without pesky things like warm personality

www.neowin.net/amp/openai-i...

November 8, 2025 at 4:46 PM

Tim Kellogg

@timkellogg.me

“nah, we don’t do 996”

November 8, 2025 at 12:49 PM

Tim Kellogg

@timkellogg.me

this morning, X is saturated with people from US claiming that their favorite unknown benchmark (that happens to show K2 trailing US models) is actually the best single benchmark to watch

lol notice how they clipped off the top 12

A leaderboard-style table ranking AI models by performance percentage.

Rank Model Score Organization
13th o1-preview 41.7% OpenAI
14th Claude 3.5 Sonnet 10-22 41.4% Anthropic
15th Gemini 2.5 Flash (latest) 41.2% Google
16th DeepSeek R1 05/28 40.8% DeepSeek
17th o1-2024-12-17 (high) 40.1% OpenAI
18th DeepSeek V3.1 40.0% DeepSeek
19th Kimi K2 Thinking (NEW) 39.6% Moonshot AI

The table shows incremental differences between model scores, with Kimi K2 Thinking newly added to the list at 19th place, just below DeepSeek V3.1.

November 8, 2025 at 12:10 PM

Tim Kellogg

@timkellogg.me

Alex Wise @awssnarkitect.bsky.social • 2d
I still think someone should build a wrapper that sits in front of all of your
MCP servers and lets you check off exactly which tools you want to present to the agent. Having each MCP server implement this their own way is bad.

November 8, 2025 at 11:01 AM

Tim Kellogg

@timkellogg.me

wow, i had no idea

A simple cartoon-style bar chart titled “Model Training Costs in $.”

Two colored bars compare training expenses for Kimi-K2 and GPT-5, with a legend at the top:
• Light pink = Without thinking
• Darker pink = With thinking
• Kimi-K2: stacked bar with two sections — 2.8M (without thinking) and 4.6M total (with thinking).
• GPT-5: single bar showing 69M.

The chart visually emphasizes that Kimi-K2’s total training cost (4.6M) is dramatically lower than GPT-5’s (69M).

November 7, 2025 at 8:26 PM

Tim Kellogg

@timkellogg.me

K2-Thinking is available in the Kimi app now

A screenshot of a dark-themed chat interface showing toggles for two AI features.

At the top of the pop-up menu:
• 🌐 Search — labeled “Enable to search web,” with the toggle switched on (blue).
Below it:
• 💡 Thinking — labeled “Enable for reasoning,” also switched on (blue).

In the background, the input bar shows the model identifier K2 on the left and the text prompt placeholder “Ask anything…”. A few circular icons for audio input and settings appear beside it. The design is minimalist, with a sleek, modern UI against a black background.

November 7, 2025 at 7:29 PM

Tim Kellogg

@timkellogg.me

longer form position here

www.vaticannews.va/en/pope/news...

i really like this part

"The question is not merely what Al can do,' the Pope wrote, "but who we are becoming through the technologies we build."

November 7, 2025 at 6:35 PM

Tim Kellogg

@timkellogg.me

GPT-5.1 is live on OpenRouter via stealth preview

OpenRouter & @OpenRouterAI
X.com
The new stealth model, "Polaris Alpha",
now live.
It's a powerful, general-purpose model that excels across real-world tasks, with standout performance in coding, tool calling, and instruction following.
Polaris Alpha
openrouter/polaris-alpha
Created Nov 6, 2025
$0/M input tokens
256,000 context
$0/M output tokens
This is a cloaked model provided to the community to gather feedback. A powerful, general-purpose model that excels across real-world tasks, with standout performance in coding, tool calling, and instruction following.
Note: All prompts and completions for this model are logged by the provider and may be used to improve the model.

November 7, 2025 at 4:15 PM

Tim Kellogg

@timkellogg.me

i haven’t figured out how to use it, but apparently Kimi K2-Thinking has a Heavy mode with 8 parallel trajectories that are reflectively aggregated

it does better than GPT-5-pro on HLE

A dark-themed comparison table titled “Reasoning Tasks” showing benchmark results across six large models: K2 Thinking, GPT-5, Claude Sonnet 4.5 (Thinking), K2 0905, DeepSeek-V3.2, and Grok-4.

Benchmarks and highlights:
• Intro:
• no tools: scores range 7.9–26.3 (highest: GPT-5 26.3).
• Humanity’s Last Exam (Text-only):
• w/ tools: K2 Thinking 44.9, GPT-5 41.7, Grok-4 41.0.
• heavy: K2 51.0 (best), GPT-5 42.0.
• AIME 2025:
• no tools: GPT-5 94.6, K2 94.5, Grok-4 91.7.
• w/ python: Claude 4.5 (Thinking) 100.0, K2 99.1, GPT-5 99.6.
• heavy: K2 100.0 (also GPT-5 100.0 and Grok-4 100.0).
• HMMT 2025:
• no tools: K2 89.4 (top), GPT-5 93.3 slightly higher.
• w/ python: GPT-5 96.7 (best), K2 95.1.
• heavy: GPT-5 100.0 (best), K2 97.5.
• IMO-AnswerBench:
• no tools: K2 78.6 (best), GPT-5 76.0, Claude 4.5 65.9.
• GPQA-Diamond:
• no tools: Grok-4 87.5 (best), GPT-5 85.7, K2 84.5.

Blue numbers mark top scores, yellow “heavy” labels denote advanced or tool-assisted reasoning modes. Overall, K2 Thinking and GPT-5 dominate most reasoning benchmarks, with Claude 4.5 occasionally matching at 100 on AIME 2025.

November 7, 2025 at 4:04 PM

Tim Kellogg

@timkellogg.me

K2-Thinking is SOTA, top model in agentic tool calling

A horizontal bar chart titled “τ²-Bench Telecom (Agentic Tool Use)” comparing AI model performance across vendors.

Each bar shows a model’s accuracy percentage, color-coded by provider.

From left to right:
• Kimi K2 Think — 93% (blue, highest)
• GPT-5 (high) — 87% (black)
• MiniMax-M2 — 87% (pink)
• GPT-5 (base) — 85%
• Claude 4.5 Sonnet — 78%
• Grok-1 — 75%
• Kimi K2 0905 — 73%
• Claude 4.1 Opus — 71%
• GLM-4-9B — 71%
• Abel-v1.15 / 1.85B Thinker — 68%
• gpt-oss-210D (high) — 66%
• Grok 4 (test) — 66%
• Kimi K2 — 61%
• Claude 4.5 Haiku — 55%
• Gemini 2.5 Pro — 54%
• Qwen 2.5 32B — 53%
• Amazon Bedrock Medistinct-12 — 52%
• DeepSeek R1 025B — 37%
• DeepSeek V3 24B — 34%
• Nim Llama Super-490B v1.5 — 28%
• Llama Maverick — 18% (lowest).

A purple arrow points from MiniMax-M2 (87%) to Kimi K2 Think (93%).
The top-right corner shows “Artificial Analysis” as the source.

November 7, 2025 at 10:40 AM

Tim Kellogg

@timkellogg.me

this really highlights how LLMs do math

math is a string of many operations, so one small error (e.g. a misremembered shortcut) causes cascading calculation errors downstream

In between those extremes lie tasks like math and question-answering. Perhaps surprisingly, some mathematical tasks seem to rely on memorization-heavy structure more than most of the other tasks we tested. When the model solves an arithmetic problem like "30 + 60," its learnt rule appears to recruit parts of the model that are also used for memorized sequences, so removing those components often disrupts these precise operations.
In the example below from GSM8K, the reasoning chain remains intact, but the model makes an arithmetic mistake in the final calculation. This and similar examples seem to indicate that the reduced performance on math benchmarks comes largely from arithmetic errors. Since solving word problems requires both reasoning (to understand and formalize the question) and calculation, the edited model's poor arithmetic abilities mean it does poorly on the overall math benchmarks - even though its reasoning capabilities are preserved.

November 7, 2025 at 1:02 AM

Tim Kellogg

@timkellogg.me

Surprising: Math requires a lot of memorization

Goodfire is at it again!

They developed a method similar to PCA that measures how much of an LLM’s weights are dedicated to memorization

www.goodfire.ai/research/und...

A bar chart titled “Relative benchmark performance after K-FAC edit.”

The y-axis shows K-FAC Edit Accuracy / Baseline (ranging from 0.0 to 1.0).
The x-axis lists various benchmarks from left to right, grouped by category and color-coded:
• Dark blue (Memory): Heldout, Quotes — strong drop, near zero to 0.2.
• Light blue (Math): GSM8K, MMLU-Pro Math, SimpleMath — moderate performance (~0.65–0.75).
• Pale blue (Closed-book QA): PopQA, TriviaQA, Relations — higher (~0.8–0.9).
• Light orange (Open-book QA): TriviaQA-Open, BoolQ, OBQA — near 1.0.
• Red-orange (Logic): Boar, Etruscan, Winogrande, Logical Deduction, Tracking Objs, Bool Expr. — around 1.0 or slightly above.

At the bottom, a gradient arrow labeled “Memorization (specialized patterns)” → “Reasoning (shared mechanisms)” illustrates the trend: memory-heavy tasks degrade sharply after K-FAC editing, while reasoning-based tasks retain or improve performance.

November 7, 2025 at 1:02 AM

Tim Kellogg

@timkellogg.me

notable: they ripped out the silicon that supports training

they say: “it’s the age of inference”

which, yeah, RL is mostly inference. Continual learning is almost all inference. Ambient agents, fast growing inference demands in general audiences

kartik343.wixstudio.com/blogorithm/p...

Key Architecture Innovations
Ironwood's matrix multiply units (MXUs) have been entirely reengineered for inference-only operations with some major differences from training-centric architectures:

November 7, 2025 at 12:43 AM

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news