Sebastian Raschka (rasbt)
banner
sebastianraschka.com
Sebastian Raschka (rasbt)
@sebastianraschka.com
ML/AI researcher & former stats professor turned LLM research engineer. Author of "Build a Large Language Model From Scratch" (https://amzn.to/4fqvn0D) & reasoning (https://mng.bz/Nwr7).

Also blogging about AI research at magazine.sebastianraschka.com.
Pinned
Just updated the Big LLM Architecture Comparison article...
...it grew quite a bit since the initial version in July 2025, more than doubled!
magazine.sebastianraschka.com/p/the-big-ll...
Uploaded my State of LLMs 2025 report for this year:
magazine.sebastianraschka.com/p/state-of-l...

I planned to just write a brief overview, but yeah, it was an eventful year so it was impossible to keep it below 7000 words :D.
The State Of LLMs 2025: Progress, Progress, and Predictions
A 2025 review of large language models, from DeepSeek R1 and RLVR to inference-time scaling, benchmarks, architectures, and predictions for 2026.
magazine.sebastianraschka.com
December 30, 2025 at 4:22 PM
One of the underrated papers this year:
"Small Batch Size Training for Language Models:
When Vanilla SGD Works, and Why Gradient Accumulation Is Wasteful" (arxiv.org/abs/2507.07101)

(I can confirm this holds for RLVR, too! I have some experiments to share soon.)
December 29, 2025 at 3:52 PM
I think of it as this: LLMs lower the barrier of entry, and they make coders (beginners and experts) more productive.
It's still worth investing in becoming an expert, because then you will get even more out of LLMs and will be able to deliver even better results.
December 28, 2025 at 4:03 PM
The LLM eras:

202x Pre-training (foundation)
2022 RLHF + PPO
2023 LoRA SFT
2024 Mid-Training
2025 RLVR + GRPO
2026 Inference-time scaling?
2027 Continual learning?
December 22, 2025 at 3:40 PM
Just updated the Big LLM Architecture Comparison article...
...it grew quite a bit since the initial version in July 2025, more than doubled!
magazine.sebastianraschka.com/p/the-big-ll...
December 13, 2025 at 2:22 PM
Hold on a sec, Mistral 3 Large uses the DeepSeek V3 architecture, including MLA?

Just went through the config files; the only difference I could see is that Mistral 3 Large used 2x fewer experts but made each expert 2x large.
December 12, 2025 at 7:14 PM
Excited for my first conference in Europe in April. I’ll be talking about LLMs, Python, coding, and all the fun stuff, and I’m looking forward to meeting fellow AI builders there!
We’re thrilled to welcome Sebastian Raschka to PyCon DE & PyData 2026.

From Scratch to Scale: How Far Python Takes You in Building LLMs
A deep dive into how Python powers experimentation, training, and scaling of LLMs.

📍 Darmstadt, April 14–16, 2026
December 5, 2025 at 4:21 AM
This interesting week started with DeepSeek V3.2!

I just wrote up a technical tour of the predecessors and components that led up to this:

🔗 magazine.sebastianraschka.com/p/technical-...

- Multi-Head Latent Attention
- RLVR
- Sparse Attention
- Self-Verification
- GRPO Updates
A Technical Tour of the DeepSeek Models from V3 to V3.2
Understanding How DeepSeek's Flagship Open-Weight Models Evolved
magazine.sebastianraschka.com
December 3, 2025 at 2:51 PM
Looks like we got a new DeepSeek model over the holidays (again): github.com/deepseek-ai/...

Basically pushes RLVR & self-refinement to gold-level scores on IMO 2025.

Coincidentally, I am currently working on a chapter on self-refinement, and this comes in handy as a nice, scaled-up case study.
November 29, 2025 at 3:11 PM
Lots of interesting LLM releases last week. My fav was actually Olmo 3 (I love the Olmo series due to their full open-sourceness and transparency).
If you are interested in reading through the architecture details, I coded it from scratch here: github.com/rasbt/LLMs-f...
November 23, 2025 at 2:31 PM
Inference-scaling lets us trade extra compute for better modeling accuracy. Next to RL, it has become one of the most important concepts in today's LLMs, so the book will cover it in two chapters instead of just one.

If you are looking for sth to read this weekend Ch4 is available now: mng.bz/Dwra
November 20, 2025 at 2:44 PM
What should we focus on, (more) LLM training or inference scaling? (A question I got asked multiple times now, so here are some thoughts.)

Training is usually very, very expensive, but it is a one-time cost. Inference-scaling is comparatively cheap, but it's a cost we pay at each query.
November 18, 2025 at 4:29 PM
My "The Building Blocks of Today’s and Tomorrow’s Language Models" talk at the PyTorch Conference is now up on YouTube! youtube.com/watch?v=nDl6...

The silver lining of my late arrival and rescheduling: There was no talk after mine, it's followed by a 30 min Q&A instead of just the usual 5 :)
The Building Blocks of Today’s and Tomorrow’s Language Models - Sebastian Raschka, RAIR Lab
YouTube video by PyTorch
youtube.com
November 8, 2025 at 2:01 PM
I just saw the Kimi K2 Thinking release!

Kimi K2 is based on the DeepSeek V3/R1 architecture, and here's a side-by-side comparison.

In short, Kimi K2 is a slightly scaled DeepSeek V3/R1. And the gains are in the data and training recipes. Hopefully, we will see some details on those soon, too.
November 6, 2025 at 7:35 PM
My new field guide to alternatives to standard LLMs:

Gated DeltaNet hybrids (Qwen3-Next, Kimi Linear), text diffusion, code world models, and small reasoning transformers.

🔗 magazine.sebastianraschka.com/p/beyond-sta...
November 4, 2025 at 2:49 PM
Just saw the benchmarks of the new open-weight MiniMax-M2 LLM, and the performance is too good to ignore :). So, I just amended my "The Big LLM Architecture Comparison" with entry number 13!

Link to the full article: magazine.sebastianraschka.com/p/the-big-ll...
October 28, 2025 at 4:48 PM
A short talk on the main architecture components of LLMs this year + a look beyond the transformer architecture: www.youtube.com/watch?v=lONy...
October 27, 2025 at 3:45 PM
🔗 Mixture of Experts (MoE): github.com/rasbt/LLMs-f...
October 20, 2025 at 1:48 PM
Chapter 3, and with it the first 176 pages, is now live! (mng.bz/lZ5B)
October 16, 2025 at 1:35 PM
Sliding Window Attention
🔗 github.com/rasbt/LLMs-f...
October 13, 2025 at 1:51 PM
Multi-Head Latent Attention
🔗 github.com/rasbt/LLMs-f...
October 12, 2025 at 1:57 PM
Just a bit of weekend coding fun: A memory estimator to calculate the savings when using grouped-query attention vs multi-head attention (+ code implementations of course).

🔗 github.com/rasbt/LLMs-f...

Will add this for multi-head latent, sliding, and sparse attention as well.
October 11, 2025 at 1:46 PM
Updated & turned my Big LLM Architecture Comparison article into a video lecture.

The 11 LLM archs covered in this video:
1. DeepSeek V3/R1
2. OLMo 2
3. Gemma 3
4. Mistral Small 3.1
5. Llama 4
6. Qwen3
7. SmolLM3
8. Kimi 2
9. GPT-OSS
10. Grok 2.5
11. GLM-4.5/4.6

www.youtube.com/watch?v=rNlU...
The Big LLM Architecture Comparison
YouTube video by Sebastian Raschka
www.youtube.com
October 10, 2025 at 5:05 PM
From the Hierarchical Reasoning Model (HRM) to a new Tiny Recursive Model (TRM).

A few months ago, the HRM made big waves in the AI research community as it showed really good performance on the ARC challenge despite its small 27M size. (That's about 22x smaller than the smallest Qwen3 0.6B model.)
October 9, 2025 at 4:23 PM
It only took 13 years, but dark mode is finally here
sebastianraschka.com/blog/2021/dl...
October 8, 2025 at 1:50 AM