🎉 llm-d v0.7 is officially live! If our earlier releases proved what llm-d could do, v0.7 is about making sure you can easily deploy and run it in production. Backed by a 3.5x surge in community PR volume, this release focuses entirely on production hardening, eliminating operational friction, and expanding hardware reach. Here is the quick technical breakdown of what’s new: ⚙️ Streamlined Day-1 Ops: Clone to serving in minutes using the new Standalone Mode (Envoy default), plus a complete migration to Kustomize-first deployment pipelines. 🔌 Blackwell & Multi-Hardware: Upgraded to CUDA 13 for native NVIDIA Blackwell support, alongside validated production images for AMD ROCm, Intel XPUs, Google TPUs, and Rebellions ATOM. 🧠 Workload-Aware Routing: Introduced experimental Flow Control to eliminate noisy-neighbor issues, and an OpenAI-compatible Batch Gateway for heavy offline workloads. 💾 Tiered KV Caching: Real-time prefix cache tracking is now enabled by default, paired with seamless cache offloading from GPU HBM to CPU and persistent storage (AWS EFS/NVMe). We’ve also added 10,000+ lines of brand-new documentation and an overhauled, multi-platform CI matrix to ensure what we guide is exactly what you deploy. A massive thank you to our 23 new contributors and hardware ecosystem partners for making this milestone happen. Read the full architectural breakdown on our blog: 👇 https://lnkd.in/eHvUECVQ #AIInfrastructure #LLMInference #OpenSource #Kubernetes #PlatformEngineering
llm-d
Software Development
Open source project providing distributed inferencing for Generative AI runtimes on any Kubernetes cluster.
About us
llm-d is a new open source project focused on providing distributed inferencing for Generative AI runtimes on any Kubernetes cluster. Its architecture is designed for high performance and scalability, aiming to reduce costs through a spectrum of hardware and software efficiency improvements. llm-d prioritizes ease of deployment and use, as well as the operational needs of running large GPU clusters, including SRE concerns and day 2 operations. At launch, its key features include prefill/decode disaggregation, KV cache distribution and management, an AI-aware router with customizable scoring, operational telemetry, Kubernetes-based deployment, and the NIXL inference optimized transfer library.
- Website
-
https://llm-d.ai/
External link for llm-d
- Industry
- Software Development
- Company size
- 11-50 employees
- Type
- Nonprofit
- Founded
- 2025
Employees at llm-d
Updates
-
llm-d reposted this
Kubernetes didn't win because it was the first container orchestrator. It won because it became the open standard everyone could build on. AI inference needs the same moment. LLM workloads are stateful, latency-sensitive, and wildly variable in cost. Standard service routing wasn't built for this. That gap is exactly what llm-d addresses. By contributing llm-d to the CNCF, Red Hat, alongside CoreWeave, IBM, Google, and NVIDIA, is making a long-term bet: that the future of enterprise AI runs on open standards, not proprietary lock-in. Read more: https://red.ht/4tK5n7d #OpenSource #CNCF #AIInference #CloudNative #RedHat #llmd #EnterpriseAI
-
-
Want to learn more about how Google is helping the open source community with their involvement in llm-d? Join us in Boston THIS WEEK for the llm-d meetup at their Cambridge, MA office. https://luma.com/eqbc1gxq Join now before registration closes later today.
-
llm-d reposted this
TPU ALERT: For OSS production Kubernetes distributed inferencing, Google just added nightly CI for llm-d. Great step by Google to start enabling the wider ML community for TPUs. TPU is catching up to NVIDIA for llm-d CI & code quality. In comparison, although AMD's official recommended production kubernetes inferencing solution is llm-d, Anush E. has yet to add any AMD GPUs or AMD NICs into the CI.
-
-
llm-d reposted this
Your Kubernetes load balancer has no idea what's happening inside your vLLM pods. That's why your P99 explodes. llm-d, now a CNCF Sandbox project, fixes this with cache-aware routing that actually understands inference state. I wrote a deep dive on what it does and when it's worth adopting. #Kubernetes #LLM #vLLM #AI #Infrastructure #CNCF #PlatformEngineering #MLOps
-
llm-d reposted this
I really enjoyed IBM Research’s Martin Hickey’s in-depth #OSSummit presentation on KV-cache management and scaling #AI inference using vLLM, llm-d, and Kubernetes (Official)! Priya Nagpurkar Jeffrey Welser Cara Delia Shuchi Sharma
-
-
Join the vLLM and llm-d maintainers next month in London!
Open source AI is thriving in the UK - and if you want to be part of this moment you need to join us on 10 June at Sustainable Ventures in County Hall, London! Whether you’re looking to squeeze every last token out of your GPU cluster or you're curious about the latest commits to the vLLM and llm-d ecosystems, this is the room you want to be in. Join Stuart Battersby, David Hughes, Ganesh K., Eldar Kurtić, Michael Goin, and Saša Zelenović to learn about vLLM and llm-d at this event sponsored by NVIDIA, Stelia AI, and Red Hat. Kanishka Narayan MP, Minister for AI and Online Safety, has expressed his ambition for the UK to become the global home of open source AI talent. We hope to catalyse the UK open source AI community to deliver on that goal! Event link in the comments… Tom Stockton James White Henry Irvine Luna Mustfa Tom B. Andrew Larkham PhD Nayyab Naqvi, PhD MBCS Jack Perschke Martha Dacombe Guy Ward Jackson Vinous Ali
-
-
We’re bringing the llm-d community to Boston Tech Week on May 28th, and we’ve got some serious heavy hitters joining us. Thanks to our friends at Google for the support, we’ve put together a lineup of contributors and users who are actually building the future of LLMs, not just talking about it: Tyler Michael Smith (Red Hat) Sean Horgan (Google) Peter Tanski (Capital One) The agenda is evolving, but the seats are filling up fast. If you’re in the Boston area and want to dive deep into LLM development, you won’t want to miss this. 👉 Secure your spot here: https://luma.com/eqbc1gxq
-
llm-d reposted this
CoreWeave 🤝 Red Hat For many enterprise AI teams, choosing between on-prem and cloud isn't an option. You have to run both. Today, CoreWeave and Red Hat published a deployment blueprint for hybrid inference: the same open-source stack, running consistently across your data center and CoreWeave. No proprietary lock-in. No new abstractions to learn. Production inference that works the way your team does. See how in our latest blog. https://lnkd.in/g_gK8xwU
-
-
Moving a Large Language Model from a cool demo to a production-ready service is where the "real" work begins. In production, it’s no longer just about accuracy, it’s about latency consistency, GPU efficiency, and cost-to-serve. The llm-d community is excited to see a deep dive from Oracle Cloud Infrastructure (OCI) on how our Kubernetes-native framework is solving these exact challenges at scale. The Highlights: 🔹 Disaggregated Prefill-Decode (PD): By using llm-d to separate the "prefill" (input) and "decode" (output) phases, teams can optimize for the unique hardware demands of each. 🔹 Performance Wins: Recent testing on AMD MI300X GPUs showed that a 16-GPU disaggregated setup can outperform a 32-GPU traditional setup. That’s 2x the stability for 50% of the cost. 🔹 Open Collaboration: This isn't just about benchmarks. We are proud of the collaboration between Red Hat, AMD, and the Oracle OCI teams to upstream this guidance directly into the llm-d repository. Why this matters for the Open Source Community: llm-d was founded to provide a CNCF-aligned, vendor-neutral way to manage distributed inference. Seeing major providers like Oracle validate these patterns on high-performance bare metal hardware proves that the open-source ecosystem is leading the charge in production AI efficiency. The path from "demo" to "enterprise-grade" is getting shorter. If you’re looking to scale your LLM inference without throwing unlimited hardware at the problem, it’s time to look at disaggregated serving. Read more here: https://lnkd.in/djcSCiNN