Sebastian Raschka (rasbt)
banner
sebastianraschka.com
Sebastian Raschka (rasbt)
@sebastianraschka.com
ML/AI researcher & former stats professor turned LLM research engineer. Author of "Build a Large Language Model From Scratch" (https://amzn.to/4fqvn0D) & reasoning (https://mng.bz/Nwr7).

Also blogging about AI research at magazine.sebastianraschka.com.
Awesome, I am glad to hear this!
November 11, 2025 at 10:14 PM
Happy reading and coding! Regarding the calculus part, I do have something here :)

sebastianraschka.com/pdf/books/dl...
sebastianraschka.com
November 11, 2025 at 1:10 PM
currently halfway done with chapter 4 😁
November 5, 2025 at 11:32 PM
I like trying new and different things. I don’t have as many convos here, but those I had were quite insightful, so I am hoping for more of that!
November 5, 2025 at 3:25 PM
haha you never now on the internet these days. But joking aside, I did update the article with MiniMax M2 and Kimi Linear last week :)
magazine.sebastianraschka.com/p/the-big-ll...
The Big LLM Architecture Comparison
From DeepSeek-V3 to Kimi K2: A Look At Modern LLM Architecture Design
magazine.sebastianraschka.com
November 2, 2025 at 2:09 AM
Had that on my list for this wknd!
October 31, 2025 at 2:19 PM
This is from a PyTorch developer’s perspective. For that, it’s great so far!
October 31, 2025 at 2:18 PM
Kimi came 2 days later 😆
October 31, 2025 at 2:17 PM
ha, very timely! Just got back from the conference and haven't had a chance to read the M2 report. But based on the Model Hub, it seems that SWA is not the default (similar to the recent Mistral Models) 🤔
(Source: huggingface.co/MiniMaxAI/Mi...)
October 27, 2025 at 6:11 PM
there are trade-offs, but I find it refreshing that people are working on this :)
October 27, 2025 at 3:53 PM
You could, but I would not recommend 😅
bsky.app/profile/seba...
Yes, if you set n_kv_groups = n_heads then it's multi-query attention. But it's not recommended as it too much of an extreme case and results in poor modeling performance. I don't know any LLM using it.
October 11, 2025 at 5:16 PM
Yes, if you set n_kv_groups = n_heads then it's multi-query attention. But it's not recommended as it too much of an extreme case and results in poor modeling performance. I don't know any LLM using it.
October 11, 2025 at 2:04 PM
But keep in mind that they are special-purpose models, not text models. But that's also the whole selling point: small special purpose models instead of LLMs.
October 9, 2025 at 6:09 PM
Interesting! I am not sure it will be a concatenation of models (I can't see how it would work), but I can see using modules like this that a generalist model can call for specific problems.
October 9, 2025 at 5:23 PM