llm-d
banner
llm-d.ai
llm-d
@llm-d.ai
llm-d is a Kubernetes-native distributed inference serving stack providing well-lit paths for anyone to serve large generative AI models at scale.

Learn more at: https://llm-d.ai
Want to learn more about open-source distributed inference? 🚀

Join contributors from vLLM and llm-d at NVIDIA Dynamo Day to see how the community is building the future of distributed inference.

📍 Virtual & Free 📅 Jan 22 | 8AM–1PM PT 🔗 nvevents.nvidia.com/dynamoday
Home
Dynamo Day
nvevents.nvidia.com
January 15, 2026 at 6:14 PM
Reposted by llm-d
If you see me around the hallway or at the sessions, I’d love to chat about:
- Model inference (KServe, vLLM, @llm-d.ai)
- @kubernetes.io AI Conformance Program
- @kubefloworg.bsky.social & @argoproj.bsky.social
- @cncf.io TAG Workloads Foundation
- Open source, cloud-native, AI infra and systems
January 15, 2026 at 5:06 PM
Reposted by llm-d
Excited to share that I'll be speaking at #KubeCon Europe in Amsterdam! You can find me in the following sessions:
1. Cloud Native AI + Kubeflow Day: Welcome + Opening Remarks: https://sched.co/2DZN3
2. Project Lightning Talk: Evolving KServe: https://sched.co/2EFyW
January 15, 2026 at 5:05 PM
Reposted by llm-d
📢 𝗧𝗵𝗲 𝗦𝘁𝗮𝘁𝗲 𝗼𝗳 𝗠𝗼𝗱𝗲𝗹 𝗦𝗲𝗿𝘃𝗶𝗻𝗴 𝗖𝗼𝗺𝗺𝘂𝗻𝗶𝘁𝗶𝗲𝘀: 𝗝𝗮𝗻𝘂𝗮𝗿𝘆 𝗘𝗱𝗶𝘁𝗶𝗼𝗻 𝗶𝘀 𝗼𝘂𝘁!

We launched our newsletter publicly last year to share our contributions to upstream communities from our Red Hat AI teams. We’ve gained over 𝟭𝟮𝟬𝟬 𝘀𝘂𝗯𝘀𝗰𝗿𝗶𝗯𝗲𝗿𝘀!
January 13, 2026 at 12:23 AM
Standardizing high-performance inference requires deep ecosystem collaboration. 🚀

Huge shoutout to @vllm_project and @IBMResearch on the new KV Offloading Connector. We’re seeing up to 9x throughput gains on H100s and massive TTFT reductions. 🧵

blog.vllm.ai/2026/01/08/k...
Inside vLLM’s New KV Offloading Connector: Smarter Memory Transfer for Maximizing Inference Throughput
In this post, we will describe the new KV cache offloading feature that was introduced in vLLM 0.11.0. We will focus on offloading to CPU memory (DRAM) and its benefits to improving overall inference…
blog.vllm.ai
January 9, 2026 at 6:45 PM
AI inference is like a busy airport: without a controller, you get gridlock. ✈️

Check out this breakdown by Cedric Clyburn from Red Hat on how llm-d intelligently routes distributed LLM requests.

🔹 Solves "round robin" congestion
🔹 Disaggregates P/D to save costs

www.youtube.com/watch?v=CNKG...
LLM‑D Explained: Building Next‑Gen AI with LLMs, RAG & Kubernetes
YouTube video by IBM Technology
www.youtube.com
January 8, 2026 at 7:21 PM
Instead of naive load balancing hitting cold replicas in multi-turn chats, llm-d ensures context is reused.

This demo shows a near 90% KV cache hit rate, a smoother time to first token, and a ~500ms drop in P95 tail latency.

https://www.youtube.com/watch?v=H2N4c-E-iw8
Unlock 90% KV Cache Hit Rates with llm-d Intelligent Routing
www.youtube.com
January 12, 2026 at 3:16 PM
Want to work on llm-d and vLLM as a full time job?
@RedHat_AI is hiring a variety of roles around open source LLM inference. https://x.com/RedHat_AI/status/2001362060777586744
January 12, 2026 at 3:16 PM
Want to understand the architecture under the hood of llm-d? ⚙️

We’ve curated our recent technical deep dives and talks from KubeCon, PyTorch Conf, and more into one central hub.

Learn Kubernetes-native distributed inference from the source. 🧵👇

https://llm-d.ai/videos
Videos | llm-d
Watch videos about llm-d: a Kubernetes-native high-performance distributed LLM inference framework
llm-d.ai
January 12, 2026 at 3:16 PM
🚀 Announcing llm-d v0.4! This release focuses on achieving SOTA inference performance across accelerators. From ultra-low latency for MoE models to new auto-scaling capabilities, we’re pushing the boundaries of open-source inference. Blog: https://t.co/qlQnzcT9O3 🧵👇
January 12, 2026 at 3:16 PM
Learn more about the llm-d integration in KServe from the recent Cloud Native AI day at Kubecon.

👇 https://x.com/TerryTangYuan/status/1992995298105290794
January 12, 2026 at 3:15 PM
Running massive models like DeepSeek-R1 requires serious distributed infrastructure.

Enter llm-d ⚡️

Join @RedHat_AI's Rob Shaw for a deep dive into this open-source framework for optimizing distributed LLM inference using a "well-lit paths" approach

👉 https://www.youtube.com/watch?v=_xAXb70d4-0
Distributed inference with llm-d’s “well-lit paths”
www.youtube.com
January 12, 2026 at 3:15 PM
🚀 llm-d v0.3.1 is LIVE! 🚀 This patch release is packed with key follow-ups from v0.3.0, including new hardware support, expanded cloud provider integration, and streamlined image builds. Dive into the full changelog: https://t.co/Wh6OGJ0KdO #llmd #OpenSource #vLLM #Release
January 12, 2026 at 3:15 PM
llm-d just passed 2,000 stars on GitHub!

⭐️ A Kubernetes-native distributed LLM inference framework built for performance and scalability.

Join the community today!

https://llm-d.ai/docs/community
Contributing to llm-d | llm-d
Guidelines for contributing to the llm-d project
llm-d.ai
January 12, 2026 at 3:15 PM
Going to #KubeCon Atlanta?

Join the llm-d communites sessions exploring how to route and scale LLM inference on Kubernetes.

From prefix-aware routing to multi-accelerator deployments - come learn what we've been building.

Schedule: https://llm-d.ai/docs/community/events
Upcoming llm-d Events | llm-d
Meet the llm-d community at upcoming talks, meetups, and conferences
llm-d.ai
January 12, 2026 at 3:15 PM
Register now to meet with and learn directly from the llm-d and vllm core contributors! https://x.com/RedHat_AI/status/1983174562310451279
January 12, 2026 at 3:15 PM
We're gathering feedback on our new "Wide-EP LWS" deployment pattern using Kustomize.

Your input on how it compares to the previous Helmfiles approach is crucial for our v0.4 release cycle.

Please share your thoughts in our short form! 👇

📝 https://t.co/HGDIusHtBu
Feedback on New Wide-EP LWS Deployment Pattern (Kustomize vs. Helmfiles)
Please provide your feedback on the new Wide-EP LWS deployment pattern using Kustomize, compared to Helmfiles. Your input is crucial for the v0.4 release cycle. This form will capture feedback for at least 2 weeks. For additional details about this design change for llm-d refer to the public "Improvements to the llm-d well lit path configurations" document. Join the llm-d-contributors Google Group for access to the above document
docs.google.com
January 12, 2026 at 3:15 PM
🚀 Evolving for Impact! We're updating our llm-d SIG meeting schedule to a bi-weekly cadence. This gives our community more time for deep work between calls, making our sessions even more focused and productive. Here are the details 👇
January 12, 2026 at 3:15 PM
We are thrilled to announce the release of llm-d v0.3! 🚀 This release is a huge milestone, powered by our incredible community, as we continue to build wider, well-lit paths for high-performance, hardware-agnostic, and scalable inference. 🧵Let's dive into what's new!
January 12, 2026 at 3:15 PM
Running LLMs on Kubernetes? You've likely felt the pain of re-processing the same context tokens over and over (think RAG system prompts). This is a huge source of inefficiency in distributed inference. Let's break down how we're solving this with llm-d. 🧵
January 12, 2026 at 3:15 PM
🗓️ Conference season is heating up, and the llm-d community is out in full force!

Want to learn about efficient, scalable LLM inference directly from the experts?

Here's where you can find us this month: 👇
January 12, 2026 at 3:15 PM
In production LLM inference, this metric matters: KV-Cache hit rate. Why? A cached token is up to 10x cheaper to process than an uncached one. But when you scale out, naive load balancing creates a costly disaster: the "heartbreaking KV-cache miss." https://red.ht/46A4ynW
January 12, 2026 at 3:15 PM
Your distributed vLLM setup is likely wasting GPU cycles. The culprit? Your load balancer.

Standard round-robin is blind to KV-cache state, leading to cache misses that force costly re-computation of tokens

Our latest post on the llm-d blog dives deep into this problem. 🧵
January 12, 2026 at 3:15 PM
The llm-d community is crushing it! 🚀 We're constantly seeing amazing knowledge sharing, and our new Events page is the perfect spot to find it all.

Case in point: an upcoming talk from Jeff Fan at @DigitalOcean diving into next-gen AI infrastructure with llm-d! 👇
January 12, 2026 at 3:15 PM
The llm-d community is building incredible things! 🚀 Shout-out to Ernest Wong & Sachi Desai from Microsoft for their new blog post pairing llm-d with Retrieval-Augmented Generation (RAG) on Azure Kubernetes Service (AKS)! This is a must-read guide! 👇 https://t.co/DPfRUdTLJB
January 12, 2026 at 3:15 PM