Lightnews — Scholar-powered news

evolvingstuff

@evolvingstuff.bsky.social

21 followers 160 following 0 posts

I post about machine learning and occasionally some other stuff.

Posts Replies Media Videos

Reposted by evolvingstuff

AI Firehose

@ai-firehose.column.social

MIT researchers revealed that decoder-only transformers can't learn inverse permutation tasks, challenging their expressive capacity. New methods, like using "scratch tokens," could enhance reasoning abilities in large language models. https://arxiv.org/abs/2509.24125

The Impossibility of Inverse Permutation Learning in Transformer Models

ArXiv link for The Impossibility of Inverse Permutation Learning in Transformer Models

arxiv.org

September 30, 2025 at 9:31 AM

Reposted by evolvingstuff

Tim Kellogg

@timkellogg.me

GLM-4.6 is out and things aren’t looking good for Sonnet 4.5

- improved tool calling
- improved token utilization
- improved writing

docs.z.ai/guides/llm/g...

This chart shows LLM Performance Evaluation: Agentic, Reasoning and Coding across 8 benchmarks, comparing:
• GLM-4.6 (blue)
• GLM-4.5 (green)
• DeepSeek-V3.2-Exp (purple)
• Claude Sonnet 4 (light gray)
• Claude Sonnet 4.5 (dark gray)

⸻

Benchmarks & Scores

AIME 25 (math)
• GLM-4.6: 98.6
• GLM-4.5: 93.9
• DeepSeek-V3.2-Exp: 85.4
• Claude Sonnet 4: 89.3
• Claude Sonnet 4.5: 87.0

GPQA (reasoning)
• GLM-4.6: 82.9
• GLM-4.5: 81.0
• DeepSeek-V3.2-Exp: 79.9
• Claude Sonnet 4: 77.7
• Claude Sonnet 4.5: 83.4

LiveCodeBench v6 (coding)
• GLM-4.6: 84.5
• GLM-4.5: 82.8
• DeepSeek-V3.2-Exp: 63.3
• Claude Sonnet 4: 70.1
• Claude Sonnet 4.5: 57.7

HLE – Humanity’s Last Exam
• GLM-4.6: 30.4
• GLM-4.5: 17.2
• DeepSeek-V3.2-Exp: 14.4
• Claude Sonnet 4: 19.8
• Claude Sonnet 4.5: 17.3

BrowseComp (search agent)
• GLM-4.6: 45.1
• GLM-4.5: 26.4
• DeepSeek-V3.2-Exp: 40.1
• Claude Sonnet 4: 14.7
• Claude Sonnet 4.5: 19.6

SWE-bench Verified (software engineering)
• GLM-4.6: 68.0
• GLM-4.5: 64.2
• DeepSeek-V3.2-Exp: 67.8
• Claude Sonnet 4: 72.5
• Claude Sonnet 4.5: 77.2

Terminal-Bench (agent coding)
• GLM-4.6: 40.5
• GLM-4.5: 37.5
• DeepSeek-V3.2-Exp: 37.5
• Claude Sonnet 4: 37.5
• Claude Sonnet 4.5: 35.5

τ²-Bench (tool use, weighted)
• GLM-4.6: 75.9
• GLM-4.5: 67.5
• DeepSeek-V3.2-Exp: 66.0
• Claude Sonnet 4: 53.4
• Claude Sonnet 4.5: 88.1

⸻

Highlights
• GLM-4.6 dominates in AIME 25, GPQA, LiveCodeBench v6, HLE, BrowseComp, and Terminal-Bench.
• Claude Sonnet 4.5 leads in SWE-bench Verified (77.2) and τ²-Bench (88.1).
• DeepSeek-V3.2-Exp lags behind GLM in most reasoning/coding tasks but is competitive in SWE-bench Verified and BrowseComp.
• Claude Sonnet 4 is middle-of-the-pack across most benchmarks.

This bar chart shows Average Token Usage per Interaction (input + output tokens for multiple tool calls, without cache).

Results (Tokens per Round):
• GLM-4.6: 651,525 (lowest usage, most efficient)
• GLM-4.5: 762,817
• Kimi-K2-0905: 821,759
• DeepSeek-V3.1-Terminus: 947,454 (highest usage)

Key insight:
• GLM-4.6 is the most token-efficient model, using ~31% fewer tokens per round than DeepSeek-V3.1-Terminus.
• DeepSeek-V3.1-Terminus consumes the most tokens, nearly 1M per interaction, which would translate into higher compute costs.
• Kimi-K2 and GLM-4.5 sit in the middle, with moderate token usage.

September 30, 2025 at 10:33 AM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news