Lightnews — Scholar-powered news

@suraj.io

Come see us (me & Yuhan Liu) tomorrow for our talk.

Specifically, Wednesday November 12, 2025 5:30pm - 6:00pm EST at Building B | Level 5 | Thomas Murphy Ballroom 1.

More info: sched.co/27FcQ #kubecon #vllm

KubeCon + CloudNativeCon North America 2025: LLMs on Kubernetes: Squeeze 5x GPU Effic...

View more about this event at KubeCon + CloudNativeCon North America 2025

sched.co

November 11, 2025 at 7:52 PM

Suraj Deshmukh | सुरज देशमुख

@suraj.io

Announcing Ray Direct Transport: RDMA Support in Ray Core
www.anyscale.com/blog/ray-dir...

Ray Direct Transport: RDMA Support in Ray Core (Part 1)

Ray Direct Transport enables fast and direct GPU transfers in Ray via RDMA-backed transports. Using RDT, we can achieve up to 1000x faster GPU-GPU transfers than Ray’s native object store with a few l...

www.anyscale.com

November 5, 2025 at 1:06 AM

Suraj Deshmukh | सुरज देशमुख

@suraj.io

Building a tool to copy-paste share terminal sessions using Claude Code for web
open.substack.com/pub/simonw/p...

Building a tool to copy-paste share terminal sessions using Claude Code for web

Plus Living dangerously with Claude, and prompt injection risks for ChatGPT Atlas

open.substack.com

October 24, 2025 at 8:07 PM

Suraj Deshmukh | सुरज देशमुख

@suraj.io

LMCache: An Efficient KV Cache Layer for Enterprise-Scale LLM Inference
arxiv.org/abs/2510.09665

LMCache: An Efficient KV Cache Layer for Enterprise-Scale LLM Inference

Today's LLM inference systems treat individual engines and queries independently for simplicity, but this causes significant resource inefficiencies. While there are proposals to avoid redundant compu...

arxiv.org

October 18, 2025 at 10:34 PM

Suraj Deshmukh | सुरज देशमुख

@suraj.io

Understanding Memory Management on Hardware-Coherent Platforms | NVIDIA Technical Blog developer.nvidia.com/blog/underst...

Understanding Memory Management on Hardware-Coherent Platforms | NVIDIA Technical Blog

If you’re an application developer or a cluster administrator, you’ve likely seen how non-uniform memory access (NUMA) can impact system performance. When an application is not fully NUMA-aware…

developer.nvidia.com

October 17, 2025 at 8:12 PM

Suraj Deshmukh | सुरज देशमुख

@suraj.io

Join me and Yuhan Liu for our talk at the upcoming #Kubecon NA 2025 in Atlanta: sched.co/27FcQ we will talk about increasing efficency while serving #LLMs using #vLLM & #LMCache!

KubeCon + CloudNativeCon North America 2025: LLMs on Kubernetes: Squeeze 5x GPU Effic...

View more about this event at KubeCon + CloudNativeCon North America 2025

sched.co

October 15, 2025 at 10:29 PM

Suraj Deshmukh | सुरज देशमुख

@suraj.io

Using Claude Code but with Github Copilot hosted Claude models:
github.com/surajssd/dot...

TFS @nilekh.bsky.social

github.com

October 14, 2025 at 10:06 PM

Suraj Deshmukh | सुरज देशमुख

@suraj.io

NVIDIA Blackwell Leads on SemiAnalysis InferenceMAX v1 Benchmarks | NVIDIA Technical Blog developer.nvidia.com/blog/nvidia-...

NVIDIA Blackwell Leads on SemiAnalysis InferenceMAX v1 Benchmarks | NVIDIA Technical Blog

SemiAnalysis recently launched InferenceMAX v1, a new open source initiative that provides a comprehensive methodology to evaluate inference hardware performance. Published results demonstrate that…

developer.nvidia.com

October 14, 2025 at 6:39 AM

Suraj Deshmukh | सुरज देशमुख

@suraj.io

Claude Code: Tips and Tricks

youtu.be/HSkLeECsBcw?...

Claude Code: Tips and Tricks

YouTube video by Anand Tyagi

youtu.be

October 13, 2025 at 10:54 PM

Suraj Deshmukh | सुरज देशमुख

@suraj.io

Gang Scheduling for Llama by Anca Agape and Andre Darabanov
www.youtube.com/watch?v=4Bef...

Gang Scheduling for Llama by Anca Agape and Andre Darabanov

YouTube video by @Scale

www.youtube.com

October 1, 2025 at 5:15 PM

Suraj Deshmukh | सुरज देशमुख

@suraj.io

How to Reduce KV Cache Bottlenecks with NVIDIA Dynamo | NVIDIA Technical Blog developer.nvidia.com/blog/how-to-...

#LMCache

How to Reduce KV Cache Bottlenecks with NVIDIA Dynamo | NVIDIA Technical Blog

As AI models grow larger and more sophisticated, inference, the process by which a model generates responses, is becoming a major challenge. Large language models (LLMs) like GPT-OSS and DeepSeek-R1…

developer.nvidia.com

October 1, 2025 at 4:33 AM

Suraj Deshmukh | सुरज देशमुख

@suraj.io

Disaggregation in Large Language Models: The Next Evolution in AI Infrastructure

www.infoq.com/articles/llm...

Disaggregation in Large Language Models: The Next Evolution in AI Infrastructure

Large Language Model (LLM) inference faces a fundamental challenge: the same hardware that excels at processing input prompts struggles with generating responses, and vice versa. Disaggregated serving...

www.infoq.com

October 1, 2025 at 3:48 AM

Suraj Deshmukh | सुरज देशमुख

@suraj.io

Cut Model Deployment Costs While Keeping Performance With GPU Memory Swap | NVIDIA Technical Blog developer.nvidia.com/blog/cut-mod...

Cut Model Deployment Costs While Keeping Performance With GPU Memory Swap | NVIDIA Technical Blog

Deploying large language models (LLMs) at scale presents a dual challenge: ensuring fast responsiveness during high demand, while managing the costs of GPUs. Organizations often face a trade-off…

developer.nvidia.com

September 29, 2025 at 4:58 AM

Suraj Deshmukh | सुरज देशमुख

@suraj.io

The Only Trait for Success in the AI Era—How to Build It youtu.be/xWYb7tImErI?...

The Only Trait for Success in the AI Era—How to Build It | Carnegie Mellon University Po-Shen Loh

YouTube video by EO

youtu.be

September 3, 2025 at 3:18 AM

Suraj Deshmukh | सुरज देशमुख

@suraj.io

OSDI '24 - DistServe: Disaggregating Prefill and Decoding for Goodput-optimized LLM serving youtu.be/WwJvecXOeUA?...

OSDI '24 - DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language...

YouTube video by USENIX

youtu.be

August 28, 2025 at 8:09 AM

Suraj Deshmukh | सुरज देशमुख

@suraj.io

OSDI '24 - Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve youtu.be/S8rq3pYboZY?...

OSDI '24 - Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve

YouTube video by USENIX

youtu.be

August 28, 2025 at 7:47 AM

Suraj Deshmukh | सुरज देशमुख

@suraj.io

More Nodes, More Problems: Solving Multi-Host GPU/TPU Scheduling with Dynamic Resource Allocation youtu.be/YqIHESG0suI?...

More Nodes, More Problems: Solving Multi-Host GPU/TPU Scheduli... John Belamaric & Morten Torkildsen

YouTube video by CNCF [Cloud Native Computing Foundation]

youtu.be

August 28, 2025 at 7:28 AM

Suraj Deshmukh | सुरज देशमुख

@suraj.io

Extending Kubernetes for AI | Lessons Learned From Platform Engineering
youtu.be/d9K5PSsHtDg?...

Extending Kubernetes for AI | Lessons Learned From Platform... - Susan, Lucy, Andrea, Etienne, Tim

YouTube video by CNCF [Cloud Native Computing Foundation]

youtu.be

August 28, 2025 at 7:26 AM

Suraj Deshmukh | सुरज देशमुख

@suraj.io

You Need to Be Bored. Here's Why.
www.youtube.com/watch?v=orQK...

You Need to Be Bored. Here's Why.

YouTube video by Harvard Business Review

www.youtube.com

August 27, 2025 at 1:53 PM

Suraj Deshmukh | सुरज देशमुख

@suraj.io

You can use ChatGPT and other models on a flight using onboard free WiFi via WhatsApp.

Use MetaAI out of the box or save these contacts:

- ChatGPT 1800 242 8478
- Microsoft Copilot +1 (877) 224-1042

August 27, 2025 at 12:51 PM