Lightnews — Scholar-powered news

@jacobaustin123.bsky.social

440 followers 38 following 12 posts

Researcher at Google DeepMind. I make LLMs go fast. I also play piano and climb sometimes. Opinions my own

Posts Replies Media Videos

jacobaustin123.bsky.social

@jacobaustin123.bsky.social

The rest of the book is a set of practical guides: how to write and profile parallel JAX code, and how to apply the previous two sections to real models like LLaMA-3. We also have worked problems at the end of each section if you like homework: jax-ml.github.io/scaling-book... 8/n

February 4, 2025 at 6:54 PM

jacobaustin123.bsky.social

@jacobaustin123.bsky.social

Now that we’ve talked about training, we need to talk about serving. How expensive should a model be to serve? What kind of latency can we expect? What are prefill and generation? How do we build an efficient inference service? We talk about this here: jax-ml.github.io/scaling-book... 7/n

February 4, 2025 at 6:54 PM

jacobaustin123.bsky.social

@jacobaustin123.bsky.social

Now for the good stuff! You may have heard of data or tensor parallelism, FSDP or pipelining. But why choose one over the other? Short answer: each adds communication, and the one with the lowest cost depends on the model. Part 5 dives into this: jax-ml.github.io/scaling-book... 6/n

February 4, 2025 at 6:54 PM

jacobaustin123.bsky.social

@jacobaustin123.bsky.social

5 years ago, there were many ML architectures, but today, there is (mostly) only one. _You should know the Transformer inside and out!_ How many FLOPs or params in LLaMA-3? How expensive is attention vs. a feed-forward block? You'll know after reading jax-ml.github.io/scaling-book... 5/n

February 4, 2025 at 6:54 PM

jacobaustin123.bsky.social

@jacobaustin123.bsky.social

Scaling an LLM involves distributing — a.k.a. "sharding" — its weights across multiple TPUs. To run it, we have to add cross-chip communication. Part 3 describes the TPU's communication primitives, and simple rules for multiplying sharded matrices: jax-ml.github.io/scaling-book... 4/n

February 4, 2025 at 6:54 PM

jacobaustin123.bsky.social

@jacobaustin123.bsky.social

A big chunk of this book is dedicated to understanding the hardware that provides those system resources. We emphasize TPUs in this book, but the principles and math can be adapted to GPUs too. Part 2 explains the TPU in detail: jax-ml.github.io/scaling-book... 3/n

February 4, 2025 at 6:54 PM

jacobaustin123.bsky.social

@jacobaustin123.bsky.social

Making LLMs run efficiently can feel scary, but scaling isn’t magic, it’s math! We wanted to demystify the “systems view” of LLMs and wrote a little textbook called “How To Scale Your Model” which we’re releasing today. 1/n

February 4, 2025 at 6:54 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news