Lightnews — Scholar-powered news

Reposted by A.V.

Tim Kellogg

@timkellogg.me

Opus 4.6 is here!

biggest wins on agentic search, HLE & ARC AGI 2

claude.com/blog/opus-4-...

A large comparison table showing benchmark performance across five model families, with columns labeled at the top: “Opus 4.6,” “Opus 4.5,” “Sonnet 4.5,” “Gemini 3 Pro,” and “GPT-5.2 (all models).” The Opus 4.6 column is visually highlighted with a light shaded background and rounded border.

Rows list tasks and benchmarks on the left, with percentages or scores across models:

“Agentic terminal coding (Terminal-Bench 2.0)”:
Opus 4.6: 65.4%
Opus 4.5: 59.8%
Sonnet 4.5: 51.0%
Gemini 3 Pro: 56.2% (54.2% self-reported)
GPT-5.2: 64.7% (64% self-reported, Codex CLI)

“Agentic coding (SWE-bench Verified)”:
Opus 4.6: 80.8%
Opus 4.5: 80.9%
Sonnet 4.5: 77.2%
Gemini 3 Pro: 76.2%
GPT-5.2: 80.0%

“Agentic computer use (OSWorld)”:
Opus 4.6: 72.7%
Opus 4.5: 66.3%
Sonnet 4.5: 61.4%
Gemini 3 Pro: —
GPT-5.2: —

“Agentic tool use (t2-bench)”:
Retail: Opus 4.6 91.9%, Opus 4.5 88.9%, Sonnet 4.5 86.2%, Gemini 3 Pro 85.3%, GPT-5.2 82.0%
Telecom: Opus 4.6 99.3%, Opus 4.5 98.2%, Sonnet 4.5 98.0%, Gemini 3 Pro 98.0%, GPT-5.2 98.7%

“Scaled tool use (MCP Atlas)”:
Opus 4.6: 59.5%
Opus 4.5: 62.3%
Sonnet 4.5: 43.8%
Gemini 3 Pro: 54.1%
GPT-5.2: 60.6%

“Agentic search (BrowseComp)”:
Opus 4.6: 84.0%
Opus 4.5: 67.8%
Sonnet 4.5: 43.9%
Gemini 3 Pro: 59.2% (Deep Research)
GPT-5.2: 77.9% (Pro)

“Multidisciplinary reasoning (Humanity’s Last Exam)”:
Without tools: Opus 4.6 40.0%, Opus 4.5 30.8%, Sonnet 4.5 17.7%, Gemini 3 Pro 37.5%, GPT-5.2 36.6%
With tools: Opus 4.6 53.1%, Opus 4.5 43.4%, Sonnet 4.5 33.6%, Gemini 3 Pro 45.8%, GPT-5.2 50.0%

“Agentic financial analysis (Finance Agent)”:
Opus 4.6: 60.7%
Opus 4.5: 55.9%
Sonnet 4.5: 54.2%
Gemini 3 Pro: 44.1%
GPT-5.2: 56.6% (5.1)

“Office tasks (GDPVal-AA Elo)”:
Opus 4.6: 1606
Opus 4.5: 1416
Sonnet 4.5: 1277
Gemini 3 Pro: 1195
GPT-5.2: 1462

“Novel problem-solving (ARC AGI 2)”:
Opus 4.6: 68.8%
Opus 4.5: 37.6%
Sonnet 4.5: 13.6%
Gemini 3 Pro: 45.1% (Deep Thinking)
GPT-5.2: 54.2% (Pro)

“Graduate-level reasoning (GPQA Diamond)”:
Opus 4.6: 91.3%
Opus 4.5: 87.0%
S…

February 5, 2026 at 6:03 PM

Reposted by A.V.

Grace

@gracekind.net

Here’s one that’s not going to happen

February 4, 2026 at 8:40 PM

Reposted by A.V.

Chris Paxton

@cpaxton.bsky.social

New CATL sodium ion batteries have:
- better performance in cold temps
- cheaper to make than lithium ion batteries
- significantly more stable and safer from fires.

January 27, 2026 at 12:53 AM

A.V.

@slckl.bsky.social

A very sane AI usage policy for any open source project that still cares about quality.

Hacker News 100 @hn100.atproto.rocks · 15d

AI Usage Policy
https://github.com/ghostty-org/ghostty/blob/main/AI_POLICY.md

https://news.ycombinator.com/item?id=46730504

ghostty/AI_POLICY.md at main · ghostty-org/ghostty

👻 Ghostty is a fast, feature-rich, and cross-platform terminal emulator that uses platform-native UI and GPU acceleration. - ghostty-org/ghostty

github.com

January 23, 2026 at 6:10 PM

Reposted by A.V.

norvid_studies

@norvid-studies.bsky.social

Democracy basically means electing a president. But the president,

January 21, 2026 at 3:06 PM

A.V.

@slckl.bsky.social

A more efficient and more interpretable alternative to fat FFNs in Transformers. Sounds interesting...

Sung Kim @sungkim.bsky.social · 17d

Meta replaces FFN up-projection with a layer-local embedding lookup while keeping the gate and down-projection dense, enabling stable training, lower per-token compute, better interpretability, scalable parametric memory, and consistent accuracy gains without routing or communication overhead.

January 21, 2026 at 3:15 PM

A.V.

@slckl.bsky.social

New king of 30B released?
A model size that remains largely feasible for local deployments.

Adina Yakup @adinayakup.bsky.social · 19d

Zhipu just released a powerful lightweight option of GLM 4.7

✨ 30B total/3B active - MoE
huggingface.co/zai-org/GLM-...

January 19, 2026 at 7:25 PM

Reposted by A.V.

Adina Yakup

@adinayakup.bsky.social

DeepSeek’s new work: Engram 🔥
Beyond MoE, it adds lookup style conditional memory to LLMs.

Paper: github.com/deepseek-ai/...

Can’t wait to see what’s coming next 👀

January 12, 2026 at 5:23 PM

Reposted by A.V.

Chris Paxton

@cpaxton.bsky.social

How many toddler sized robots do you think you could take in a fight

December 31, 2025 at 3:35 PM

Reposted by A.V.

Tim Kellogg

@timkellogg.me

Nvidia is buying Groq (not Grok) the fast AI inference provider

www.cnbc.com/2025/12/24/n...

Exclusive: Nvidia buying AI chip startup Groq's assets for about $20 billion in largest deal on record

Nvidia is making its largest purchase ever, acquiring assets from nine-year-old chip startup Groq for about $20 billion.

www.cnbc.com

December 24, 2025 at 10:11 PM

Reposted by A.V.

Chris Paxton

@cpaxton.bsky.social

Autonomous RIVR delivery robots in Pittsburgh

December 24, 2025 at 6:56 PM

A.V.

@slckl.bsky.social

FoundationStereo was a meaningful boost for getting nice 3d results for folks like me who barely know what a point cloud is. This new version looks almost as good, but promises to be way faster. Fingers crossed for a friendly license 🤞

Chris Paxton @cpaxton.bsky.social · Dec 17

FastFoundationStereo from nvidia. Exciting because 3d information remains one of the easiest ways to get reliability and generalization; if this becomes practical, it can accelerate robot deployment quite a lot over pure RGB-based methods. github.com/NVlabs/Fast-...

December 17, 2025 at 8:45 PM

Reposted by A.V.

Tim Kellogg

@timkellogg.me

Nemotron 3

A new hybrid mamba2/attention LLM from NVIDIA that beats Qwen3-30B-A3B (same size & shape)

Notes:
* 1M context, with incredible recall past 256K
* New open datasets
* 10 open source RL environments

Overall this is a huge win for neolabs

huggingface.co/nvidia/NVIDI...

A wide bar chart comparing **accuracy** (left axis) and **relative throughput** (right axis) across multiple benchmarks for three models.

**Legend / Models**

* **Green:** Nemotron-3-Nano-30B-A3B
* **Blue:** Qwen3-30B-A3B-Thinking-2507
* **Gray:** GPT-OSS-20B-A4B

**Left Y-axis:** Accuracy (%)
**Right Y-axis:** Relative Throughput (Output tokens/s/GPU)
A dashed vertical line separates accuracy benchmarks (left) from throughput (right).

---

### Accuracy benchmarks (left to right)

* **Arena-Hard-v2-Avg (Chat):**
Nemotron **67.7**, Qwen **57.8**, GPT-OSS **48.5**

* **AIME25 (Math):**
Nemotron **99.2** (+tools noted), Qwen **85.0**, GPT-OSS **98.7**
(lighter labels near bars show intermediate values ~89.1 and ~91.7)

* **IFBench (Inst. Following):**
Nemotron **71.5**, Qwen **51.0**, GPT-OSS **65.0**

* **τ²-Bench (Tool Use):**
Nemotron **49.0**, Qwen **47.7**, GPT-OSS **47.5**

* **SWE-Bench (Coding):**
Nemotron **38.8**, Qwen **22.0**, GPT-OSS **34.0**

* **LCB v6 (Coding):**
Nemotron **68.2**, Qwen **66.0**, GPT-OSS **61.0**

* **RULER @ 1M (Long Ctx):**
Nemotron **86.3**, Qwen **77.5**, GPT-OSS **N/A**

---

### Throughput (right of dashed line)

* **ISL/OSL 8k/16k:**
Nemotron **3.3**, Qwen **1.0**, GPT-OSS **1.5**

---

**Caption (bottom):**
*Figure 2 | The hybrid Mamba-Transformer MoE architecture used by Nemotron 3 models can achieve state-of-the-art accuracy on leading reasoning benchmarks and ultra-long-context tasks while providing throughput improvements over similarly sized Transformer MoEs. For details, please see the Nemotron Nano 3 technical report.*

December 16, 2025 at 1:15 PM

Reposted by A.V.

Grace

@gracekind.net

hittingawall.jpg

December 15, 2025 at 11:37 PM

Reposted by A.V.

Ai2

@ai2.bsky.social

Introducing Bolmo, a new family of byte-level language models built by "byteifying" our open Olmo 3—and to our knowledge, the first fully open byte-level LM to match or surpass SOTA subword models across a wide range of tasks. 🧵

December 15, 2025 at 5:19 PM

Reposted by A.V.

Tim Duffy

@timfduffy.com

Here's a great review of what we saw in AI this year, from @gleech.org

AI in 2025: gestalt — LessWrong

This is the editorial for this year’s "Shallow Review of AI Safety". (It got long enough to stand alone.) …

www.lesswrong.com

December 8, 2025 at 5:24 PM

Reposted by A.V.

Simon Willison

@simonwillison.net

Four new models from Mistral today - all Apache 2 licensed, all vision-capable, and one of them is a 3GB model that can run in a web browser and answer questions about things it can see through the webcam! simonwillison.net/2025/Dec/2/i...

Introducing Mistral 3

Four new models from Mistral today: three in their "Ministral" smaller model series (14B, 8B, and 3B) and a new Mistral Large 3 MoE model with 675B parameters, 41B active. …

simonwillison.net

December 2, 2025 at 5:32 PM

A.V.

@slckl.bsky.social

In other European news, kyutai labs, a non-profit ai research lab, spawned their (first?) for-profit branch: gradium.ai

With a 70M$ seed round, they look serious.

In their own words:
gradium.ai/blog/gradium
On the bad site, they even got a little promo video: x.com/GradiumAI/st...

Gradium

Text-to-Speech, Speech-to-Text, and Speech-to-Speech AI models

gradium.ai

December 2, 2025 at 7:28 PM

A.V.

@slckl.bsky.social

Mistral dropped ministral 3B, 8B, 14B models and the big one - a seemingly deepseek shaped Mistral large 3, 675B moe brick. All apache 2!

Happy to see some European action in the usable model space.

Mistral blog post: mistral.ai/news/mistral-3

mistral 3 benchmarks, showing it being competitive with deepseek 3.2 and kimi-k2 on MMLU, GPQA-Diamond, SimpleQA, AMC and LiveCodeBench.

December 2, 2025 at 7:19 PM

Reposted by A.V.

Grace

@gracekind.net

I’m running on a platform of Everyone Needs To Talk To Opus 4.5 For Two Hours

November 30, 2025 at 8:39 PM

Reposted by A.V.

Simon Willison

@simonwillison.net

At the risk of starting the flame war to end all flame wars...

Modern LLMs (GPT-5.1, Claude 4.5, Gemini 3) produce excellent code and can be a significant productivity boost to software engineers who take the time to learn how to effectively apply them - especially if used with coding agent tools

November 27, 2025 at 7:55 PM

Reposted by A.V.

Alexander Doria

@dorialexander.bsky.social

And a major open science release from Prime Intellect: they don’t stress it enough but SFT part is beyond post-training. This is a fully documented mid-training with tons of insights/gems on MoE training, asynchronous infra RL, deep research. storage.googleapis.com/intellect-3-...

November 27, 2025 at 7:47 AM

A.V.

@slckl.bsky.social

SAM3 dropped for those who celebrate!
Similar to SAM2, it can segment stuff based on points and track stuff, but now it can directly segment stuff based on text and image prompts, too.

Webpage: ai.meta.com/sam3/
Repo: github.com/facebookrese...