Also blogging about AI research at magazine.sebastianraschka.com.
I wrote up a new article on
(1) multiple-choice benchmarks,
(2) verifiers,
(3) leaderboards, and
(4) LLM judges
All with from-scratch code examples, of course!
sebastianraschka.com/blog/2025/ll...
The silver lining of my late arrival and rescheduling: There was no talk after mine, it's followed by a 30 min Q&A instead of just the usual 5 :)
The silver lining of my late arrival and rescheduling: There was no talk after mine, it's followed by a 30 min Q&A instead of just the usual 5 :)
Kimi K2 is based on the DeepSeek V3/R1 architecture, and here's a side-by-side comparison.
In short, Kimi K2 is a slightly scaled DeepSeek V3/R1. And the gains are in the data and training recipes. Hopefully, we will see some details on those soon, too.
Kimi K2 is based on the DeepSeek V3/R1 architecture, and here's a side-by-side comparison.
In short, Kimi K2 is a slightly scaled DeepSeek V3/R1. And the gains are in the data and training recipes. Hopefully, we will see some details on those soon, too.
Gated DeltaNet hybrids (Qwen3-Next, Kimi Linear), text diffusion, code world models, and small reasoning transformers.
🔗 magazine.sebastianraschka.com/p/beyond-sta...
Gated DeltaNet hybrids (Qwen3-Next, Kimi Linear), text diffusion, code world models, and small reasoning transformers.
🔗 magazine.sebastianraschka.com/p/beyond-sta...
Link to the full article: magazine.sebastianraschka.com/p/the-big-ll...
Link to the full article: magazine.sebastianraschka.com/p/the-big-ll...
🔗 github.com/rasbt/LLMs-f...
🔗 github.com/rasbt/LLMs-f...
🔗 github.com/rasbt/LLMs-f...
🔗 github.com/rasbt/LLMs-f...
🔗 github.com/rasbt/LLMs-f...
Will add this for multi-head latent, sliding, and sparse attention as well.
🔗 github.com/rasbt/LLMs-f...
Will add this for multi-head latent, sliding, and sparse attention as well.
The 11 LLM archs covered in this video:
1. DeepSeek V3/R1
2. OLMo 2
3. Gemma 3
4. Mistral Small 3.1
5. Llama 4
6. Qwen3
7. SmolLM3
8. Kimi 2
9. GPT-OSS
10. Grok 2.5
11. GLM-4.5/4.6
www.youtube.com/watch?v=rNlU...
The 11 LLM archs covered in this video:
1. DeepSeek V3/R1
2. OLMo 2
3. Gemma 3
4. Mistral Small 3.1
5. Llama 4
6. Qwen3
7. SmolLM3
8. Kimi 2
9. GPT-OSS
10. Grok 2.5
11. GLM-4.5/4.6
www.youtube.com/watch?v=rNlU...
A few months ago, the HRM made big waves in the AI research community as it showed really good performance on the ARC challenge despite its small 27M size. (That's about 22x smaller than the smallest Qwen3 0.6B model.)
A few months ago, the HRM made big waves in the AI research community as it showed really good performance on the ARC challenge despite its small 27M size. (That's about 22x smaller than the smallest Qwen3 0.6B model.)
sebastianraschka.com/blog/2021/dl...
sebastianraschka.com/blog/2021/dl...
I wrote up a new article on
(1) multiple-choice benchmarks,
(2) verifiers,
(3) leaderboards, and
(4) LLM judges
All with from-scratch code examples, of course!
sebastianraschka.com/blog/2025/ll...
I wrote up a new article on
(1) multiple-choice benchmarks,
(2) verifiers,
(3) leaderboards, and
(4) LLM judges
All with from-scratch code examples, of course!
sebastianraschka.com/blog/2025/ll...
If you are new to reinforcement learning, this article has a generous intro section (PPO, GRPO, etc)
Also, I cover 15 recent articles focused on RL & Reasoning.
🔗 magazine.sebastianraschka.com/p/the-state-...
If you are new to reinforcement learning, this article has a generous intro section (PPO, GRPO, etc)
Also, I cover 15 recent articles focused on RL & Reasoning.
🔗 magazine.sebastianraschka.com/p/the-state-...
Why? Because I think 1B & 3B models are great for experimentation, and I wanted to share a clean, readable implementation for learning and research: huggingface.co/rasbt/llama-...
Why? Because I think 1B & 3B models are great for experimentation, and I wanted to share a clean, readable implementation for learning and research: huggingface.co/rasbt/llama-...
In this 1:45 h hands-on coding session, I go over implementing the GPT architecture, the foundation of modern LLMs (and I also have bonus material converting it to Llama 3.2): www.youtube.com/watch?v=YSAk...
In this 1:45 h hands-on coding session, I go over implementing the GPT architecture, the foundation of modern LLMs (and I also have bonus material converting it to Llama 3.2): www.youtube.com/watch?v=YSAk...
self-attention → parameterized self-attention → causal self-attention → multi-head self-attention
www.youtube.com/watch?v=-Ll8...
self-attention → parameterized self-attention → causal self-attention → multi-head self-attention
www.youtube.com/watch?v=-Ll8...
Happy reading!
Happy reading!
- Python & PyTorch still dominate
- 80%+ use NVIDIA GPUs, but no multi-node setups 🤔
- LoRA still popular for training efficiency, but full finetuning gains traction.
Surprisingly, CNNs still lead in CV comps
- Python & PyTorch still dominate
- 80%+ use NVIDIA GPUs, but no multi-node setups 🤔
- LoRA still popular for training efficiency, but full finetuning gains traction.
Surprisingly, CNNs still lead in CV comps
Learning exercise and I am so jealous. Working off the book _Build a Large Language Model (from Scratch)_ by Sebastian Raschka
First post in series here:
www.gilesthomas.com/2024/12/llm-...
Learning exercise and I am so jealous. Working off the book _Build a Large Language Model (from Scratch)_ by Sebastian Raschka
First post in series here:
www.gilesthomas.com/2024/12/llm-...