Also blogging about AI research at magazine.sebastianraschka.com.
...it grew quite a bit since the initial version in July 2025, more than doubled!
magazine.sebastianraschka.com/p/the-big-ll...
magazine.sebastianraschka.com/p/state-of-l...
I planned to just write a brief overview, but yeah, it was an eventful year so it was impossible to keep it below 7000 words :D.
magazine.sebastianraschka.com/p/state-of-l...
I planned to just write a brief overview, but yeah, it was an eventful year so it was impossible to keep it below 7000 words :D.
"Small Batch Size Training for Language Models:
When Vanilla SGD Works, and Why Gradient Accumulation Is Wasteful" (arxiv.org/abs/2507.07101)
(I can confirm this holds for RLVR, too! I have some experiments to share soon.)
"Small Batch Size Training for Language Models:
When Vanilla SGD Works, and Why Gradient Accumulation Is Wasteful" (arxiv.org/abs/2507.07101)
(I can confirm this holds for RLVR, too! I have some experiments to share soon.)
It's still worth investing in becoming an expert, because then you will get even more out of LLMs and will be able to deliver even better results.
It's still worth investing in becoming an expert, because then you will get even more out of LLMs and will be able to deliver even better results.
202x Pre-training (foundation)
2022 RLHF + PPO
2023 LoRA SFT
2024 Mid-Training
2025 RLVR + GRPO
2026 Inference-time scaling?
2027 Continual learning?
202x Pre-training (foundation)
2022 RLHF + PPO
2023 LoRA SFT
2024 Mid-Training
2025 RLVR + GRPO
2026 Inference-time scaling?
2027 Continual learning?
...it grew quite a bit since the initial version in July 2025, more than doubled!
magazine.sebastianraschka.com/p/the-big-ll...
...it grew quite a bit since the initial version in July 2025, more than doubled!
magazine.sebastianraschka.com/p/the-big-ll...
Just went through the config files; the only difference I could see is that Mistral 3 Large used 2x fewer experts but made each expert 2x large.
Just went through the config files; the only difference I could see is that Mistral 3 Large used 2x fewer experts but made each expert 2x large.
From Scratch to Scale: How Far Python Takes You in Building LLMs
A deep dive into how Python powers experimentation, training, and scaling of LLMs.
📍 Darmstadt, April 14–16, 2026
I just wrote up a technical tour of the predecessors and components that led up to this:
🔗 magazine.sebastianraschka.com/p/technical-...
- Multi-Head Latent Attention
- RLVR
- Sparse Attention
- Self-Verification
- GRPO Updates
I just wrote up a technical tour of the predecessors and components that led up to this:
🔗 magazine.sebastianraschka.com/p/technical-...
- Multi-Head Latent Attention
- RLVR
- Sparse Attention
- Self-Verification
- GRPO Updates
Basically pushes RLVR & self-refinement to gold-level scores on IMO 2025.
Coincidentally, I am currently working on a chapter on self-refinement, and this comes in handy as a nice, scaled-up case study.
Basically pushes RLVR & self-refinement to gold-level scores on IMO 2025.
Coincidentally, I am currently working on a chapter on self-refinement, and this comes in handy as a nice, scaled-up case study.
If you are interested in reading through the architecture details, I coded it from scratch here: github.com/rasbt/LLMs-f...
If you are interested in reading through the architecture details, I coded it from scratch here: github.com/rasbt/LLMs-f...
If you are looking for sth to read this weekend Ch4 is available now: mng.bz/Dwra
If you are looking for sth to read this weekend Ch4 is available now: mng.bz/Dwra
Training is usually very, very expensive, but it is a one-time cost. Inference-scaling is comparatively cheap, but it's a cost we pay at each query.
Training is usually very, very expensive, but it is a one-time cost. Inference-scaling is comparatively cheap, but it's a cost we pay at each query.
The silver lining of my late arrival and rescheduling: There was no talk after mine, it's followed by a 30 min Q&A instead of just the usual 5 :)
The silver lining of my late arrival and rescheduling: There was no talk after mine, it's followed by a 30 min Q&A instead of just the usual 5 :)
Kimi K2 is based on the DeepSeek V3/R1 architecture, and here's a side-by-side comparison.
In short, Kimi K2 is a slightly scaled DeepSeek V3/R1. And the gains are in the data and training recipes. Hopefully, we will see some details on those soon, too.
Kimi K2 is based on the DeepSeek V3/R1 architecture, and here's a side-by-side comparison.
In short, Kimi K2 is a slightly scaled DeepSeek V3/R1. And the gains are in the data and training recipes. Hopefully, we will see some details on those soon, too.
Gated DeltaNet hybrids (Qwen3-Next, Kimi Linear), text diffusion, code world models, and small reasoning transformers.
🔗 magazine.sebastianraschka.com/p/beyond-sta...
Gated DeltaNet hybrids (Qwen3-Next, Kimi Linear), text diffusion, code world models, and small reasoning transformers.
🔗 magazine.sebastianraschka.com/p/beyond-sta...
Link to the full article: magazine.sebastianraschka.com/p/the-big-ll...
Link to the full article: magazine.sebastianraschka.com/p/the-big-ll...
🔗 github.com/rasbt/LLMs-f...
🔗 github.com/rasbt/LLMs-f...
🔗 github.com/rasbt/LLMs-f...
🔗 github.com/rasbt/LLMs-f...
🔗 github.com/rasbt/LLMs-f...
Will add this for multi-head latent, sliding, and sparse attention as well.
🔗 github.com/rasbt/LLMs-f...
Will add this for multi-head latent, sliding, and sparse attention as well.
The 11 LLM archs covered in this video:
1. DeepSeek V3/R1
2. OLMo 2
3. Gemma 3
4. Mistral Small 3.1
5. Llama 4
6. Qwen3
7. SmolLM3
8. Kimi 2
9. GPT-OSS
10. Grok 2.5
11. GLM-4.5/4.6
www.youtube.com/watch?v=rNlU...
The 11 LLM archs covered in this video:
1. DeepSeek V3/R1
2. OLMo 2
3. Gemma 3
4. Mistral Small 3.1
5. Llama 4
6. Qwen3
7. SmolLM3
8. Kimi 2
9. GPT-OSS
10. Grok 2.5
11. GLM-4.5/4.6
www.youtube.com/watch?v=rNlU...
A few months ago, the HRM made big waves in the AI research community as it showed really good performance on the ARC challenge despite its small 27M size. (That's about 22x smaller than the smallest Qwen3 0.6B model.)
A few months ago, the HRM made big waves in the AI research community as it showed really good performance on the ARC challenge despite its small 27M size. (That's about 22x smaller than the smallest Qwen3 0.6B model.)
sebastianraschka.com/blog/2021/dl...
sebastianraschka.com/blog/2021/dl...