Solo.io | Deep Dive into llm-d and Distributed Inference

Deep Dive into llm-d and Distributed Inference

Digging into the llm-d project and how it does distributed inference.

Recently, Red Hat, Google, CoreWeave and IBM announced the llm-d project, a distributed inference serving framework for Kubernetes built on vLLM and kgateway. But what does that really mean in practice? And more importantly, do you actually need distributed inferencing?

Many of the organizations I work with are choosing to run inference workloads on their own infrastructure using open-source models and GPUs. The motivations vary: some need better performance, others have strict data privacy or compliance requirements, and of course, there are always cost considerations.

The key innovation behind llm-d is how it distributes inference. It splits the inference process into two distinct phases, prefill and decode, and runs each in separate workloads (pods). The project calls this approach serving "disaggregation."

Prefill takes the user’s prompt, processes the tokens into vectors, and stores them in a key-value (KV) cache. This stage is extremely compute-heavy, but also highly parallelizable.
Once the cache is built, decode kicks in. It reads from the cache and generates the response tokens. This is a lighter process, but it’s latency-sensitive and benefits from fast access to prefilled data.

When large prompts share common prefixes, like in multi-turn chats, RAG-based interactions, or agentic workflows with long system prompts, the same cache can be reused across requests. That means faster responses and lower compute requirements. And this reuse happens more often than you'd think. Consider a chat with an assistant: each turn in the conversation often includes the full history or a consistent system prompt. With agentic systems, the context or memory passed to the LLM stays fairly stable between turns. If you're recomputing all of that from scratch every time, you're burning unnecessary GPU cycles.

That's where llm-d shines. It manages this distributed inference, cache, indexing, and routing to achieve faster inference, more efficient use of GPU resources, and lower overall costs. It uses runtime metrics about the individual inference workloads (like KV cache utilization, work queue depth, knowledge about which prefixes are in which cache on which workload, etc) to implement smart LLM endpoint routing.

Let's take a look at concrete examples to understand how llm-d works.

‍

Meet Our Inference Workload Cluster

Let's start with an example cluster setup. Note we have specific "prefill" and "decode" workloads. We also have a workload that can handle "both" phases. As we'll see in the example walkthrough, there are scenarios where we use the disaggregated / distributed inferencing, and others where we don't need to and a single workload can handle both phases. In a real environment, you'd have some ratio of prefill/decode/both workloads.

‍

Pods:
- name: "vllm-prefill-1"      # Specialized for prompt processing
  role: "prefill"
  metrics: {waitingQueue: 2, kvCacheUsage: 30%}

- name: "vllm-prefill-2"      # Another prefill specialist
  role: "prefill"
  metrics: {waitingQueue: 0, kvCacheUsage: 10%}

- name: "vllm-decode-1"       # Optimized for token generation
  role: "decode"
  metrics: {waitingQueue: 1, kvCacheUsage: 60%}

- name: "vllm-decode-2"       # Another decode specialist
  role: "decode"
  metrics: {waitingQueue: 0, kvCacheUsage: 40%}

- name: "vllm-both-1"         # Flexible pod for both phases
  role: "both"
  metrics: {waitingQueue: 3, kvCacheUsage: 80%}

The llm-d scheduler implements a multi-stage scheduling pipeline:

‍

Stage 1: Filtering

Role-based filtering: Prefill vs decode vs both
Load-based filtering: Exclude overloaded pods
Capacity filtering: Skip pods near memory limits

Stage 2: Scoring (Weighted)

Load awareness: Queue length and resource usage
Prefix matching: Cache hit probability
Session affinity: Conversation continuity
KV cache optimization: Memory reuse patterns

Stage 3: Selection

Weighted combination: All scores combined with configurable weights
Tie-breaking: Random selection among equals
Fallback: Graceful degradation when no ideal pods exist

‍

To see the following scenarios in a visual form, take a look at this video:

‍

Example 1: The Simple Case - Smart Load Balancing

Request arrives:

{
  "model": "llama-7b",
  "prompt": "Hello, how are you?",  // Short: ~5 tokens
  "max_tokens": 50
}

Scheduler's decision process:

Length Analysis: 5 tokens < 100 token threshold → Single pod sufficient
Filtering: Only decode-capable pods considered (decode-1, decode-2, both-1)
Scoring: Load-aware algorithm evaluates queue lengths:
- decode-1: 1 request waiting → score = 4.96
- decode-2: 0 requests waiting → score = 5.0 ⭐
- both-1: 3 requests waiting → score = 4.88

Result: Request routed to vllm-decode-2 (shortest queue)

‍

‍

In this case, we determined that the number of tokens in the prompt (5) did not meet a configurable threshold (100) to trigger disaggregated / distributed inference. The scheduler eliminated the prefill workloads and just considered the decode and both workloads. It scored the workloads according to its runtime characteristics (KV cache, load, and cache hit probability -- though that metric is not shown) and used the resulting scores to pick the destination workload endpoint.

‍

In the next example, we look at how to trigger disggregated / distributed inference.

‍

NOTE: llm-d builds on top of Gateway API Inference Extension -- GAIE -- to help filter and score workloads. kgateway is a reference implementation of GAIE which is used in the llm-d project.

Example 2: The Complex Case - Disaggregated Processing

Request arrives:

{
  "model": "llama-7b",
  "prompt": "You are a helpful AI assistant specialized in code review. Please analyze the following Python function and suggest improvements: def calculate_factorial(n): ...",  // Long: ~150 tokens
  "max_tokens": 200
}

‍

The scheduler detects that this is a prompt which triggers disaggregated prefill/decode processing.

‍

Decision process:

Length Analysis: 150 tokens > 100 → Disaggregated mode activated
Decode Pod Selection:
- Discovers vllm-decode-2 has cached similar "code review assistant" prefixes (40% match)
- Winner: decode-2 (load score: 5.0 + prefix bonus: 2.0 = 7.0)
Prefill Pod Selection:
- Finds vllm-prefill-1 has 60% prefix match for "code review" patterns
- Winner: prefill-1 (despite higher load, prefix cache wins)

Result:

Prefill: vllm-prefill-1 processes the prompt
Decode: vllm-decode-2 generates tokens
Header added: x-prefiller-url: http://10.0.1.10:8000

‍

In this scenario, llm-d finds that the prompt is larger, and above the specified threshold to trigger "PD" (prefill/decode disaggregation).

‍

Under the covers, llm-d uses a smarter algorithm than just “size of tokens” in the prompt. It takes into account how much of the prompt is already cached and tries to determine how much actual work would need to compute new vectors for the prompt and takes a "weight" of the size of the prompt).

‍

In this case, the selection logic will select a decode workload AND a prefill workload. It will then route the request to the decode workload with a header "x-prefiller-url" pointing to the prefill worker. The decode worker will then send the prompt to the prefill worker (based on the x-prefiller-url) to work on the prompt token vectors. The prefill will then hand back to the decode workload to finish computing tokens to be returned.

‍

You may ask, does all of this coordination across multiple workloads for a single prompt end up being less efficient? And the answer is: it depends on the nature of the prompt. That's why there is a configurable threshold (PD_PROMPT_LEN_THRESHOLD) that triggers the disaggregation. This value can be configured to get you the best results. At some threshold, there will be advantages to distributing the inference across multiple workloads.

Example 3: Crisis Management - High Load Scenarios

System state: All pods under heavy load

vllm-decode-1: {waitingQueue: 15, kvCache: 0.9}
vllm-decode-2: {waitingQueue: 8, kvCache: 0.7}
vllm-both-1: {waitingQueue: 20, kvCache: 0.95}
vllm-prefill-1: {waitingQueue: 10, kvCache: 0.8}
vllm-prefill-2: {waitingQueue: 5, kvCache: 0.6}

Request arrives:

{
  "model": "llama-7b",
  "prompt": "Translate this text to French: The quick brown fox...",
  "criticality": "Sheddable"  // Lower priority
}

‍

Scheduler's crisis response:

Queue Filtering: Automatically filters out pods with >10 waiting requests
- prefill-1: 10 requests → ❌ Filtered out
- prefill-2: 5 requests → ✅ Passes
- decode-1: 15 requests → ❌ Filtered out
- decode-2: 8 requests → ✅ Passes
- both-1: 20 requests → ❌ Filtered out
Graceful Degradation: Only decode-2 remains viable

Result: System maintains service even under extreme load by intelligently shedding requests from overloaded pods.

‍

In this case, we've managed to find a workload that can still handle the prompt. But we could have also discarded this prompt as it's marked "sheddable". This can happen when the inference gateway decides that a prompt is for a type of client (ie, some offline batch processing) that can be discarded if the system is under critical pressure.

Example 4: The Conversation Continues - Session Affinity

Request with session context:

{
  "model": "llama-7b",
  "prompt": "Continue our previous conversation about machine learning...",
  "headers": {
    "x-session-token": "dmxsbS1kZWNvZGUtMQ=="  // Points to decode-1
  }
}

Scheduler's session-aware decision:

Session Detection: Decodes token → Points to vllm-decode-1
Affinity Override: Session scorer gives decode-1 maximum score (20.0)
Final Decision: decode-1 wins despite having a queue

Result: Conversation continuity maintained

In this scenario, the client includes a previously computed "session token" with the prompt. In a previous call, after a prefill/decode action has occurred, the inference gateway can compute the "x-session-token" (which is the decode workload that serviced the prompt) and the client app/agent can include this session identifier in subsequent requests. This will ensure that prompts are routed to the same decode worker which has the KV cache from the previously computed prompts. This simplifies routing and can improve performance.

The Future of LLM Infrastructure

As enterprises increasingly turn to self-hosted LLM infrastructure for reasons ranging from cost control to compliance, the need for smarter, more efficient inference routing becomes critical. The llm-d project offers a Kubernetes-native way to scale inference workloads intelligently by leveraging disaggregated processing, cache-aware scheduling, and runtime load metrics. Whether you're building agentic systems with long, reusable context or handling bursty traffic under tight latency budgets, llm-d's approach helps strike a practical balance between performance and resource efficiency. It’s a compelling direction for the future of LLM infrastructure, one that puts you in control of how and where inference happens.

‍

Deep Dive into llm-d and Distributed Inference

Meet Our Inference Workload Cluster

Stage 1: Filtering

Stage 2: Scoring (Weighted)

Stage 3: Selection

Example 1: The Simple Case - Smart Load Balancing

Example 2: The Complex Case - Disaggregated Processing

Example 3: Crisis Management - High Load Scenarios

Example 4: The Conversation Continues - Session Affinity

The Future of LLM Infrastructure

Featured content

Deep Dive into llm-d and Distributed Inference

Gloo Mesh 2.8 simplifies service mesh operations with new enhanced user experience across multi-cluster environments.

Gloo Gateway 1.19 accelerates context-rich, real-time AI apps with Gateway API

llm-d: Distributed Inference Serving on Kubernetes

AI Reliability Engineering For More Dependable Humans

Kubernetes Identity the Right Way with SPIRE and Ambient

Optimizing GenAI in Production: High-Value Use Cases for AI Gateways

Solo.io Recognized as a Visionary in the 2024 Gartner® Magic Quadrant™ for API Management for the SECOND year in a row.

Guardians of the Governance: GenAI Gateway Guidance with GitOps and Gloo

Istio Ambient Waypoint Proxy explained

Hands-On with the Kubernetes Gateway API and Envoy Proxy: A Tutorial with GitOps and Gloo Gateway

Istio and the State of DevOps: Enhancing Key Metrics

What is an AI Gateway and its role in AI Applications?

Best practices for secure Istio deployment with Gloo Mesh Core

Gloo Mesh 2.6: Istio's Ambient mode now ready for production

HTTP Observability Without Compromises

Advance your knowledge of service mesh tech with Solo.io Academy certifications

Service Mesh for the developer workflow, a series

Challenges of adopting service mesh in enterprise organizations

Service Mesh in the Real World #2 — Ingress Traffic Control

Service Mesh in the Real World Video Series – Episode # 1: Egress Traffic

Service Mesh the easy way with AWS App Mesh and SuperGloo

Webinar Recap: Intro to Service Mesh Hub and SMI

D-TECK Uses Solo.io Gloo Gateway and Google Cloud to Help Businesses Make Better HR Decisions

Minimize the blast radius of changes with Solo.io Gloo Gateway and Weaveworks Flagger

Announcing Service Mesh Interface (SMI) Support and Collaboration

Service Mesh Interface (SMI) and our Vision for the Community and Ecosystem

The need for a standard, service mesh API

SuperGloo to the Rescue! Making it easier to write extensions for Service Mesh

Introducing The Service Mesh Hub -everything you need for your service mesh

Kubernetes Ingress Past, Present, and Future

Solo.io Streamlines Service Mesh and Serverless Adoption for Enterprises in Google Cloud

ParkMobile

Vonage

Domino’s Pizza

Gloo Mesh Feature Comparison

Introducing Gloo Gateway

Introducing Gloo Mesh Core

Enterprise Support for Istio in Production

Introducing Gloo Mesh Enterprise

Service Mesh for Developers, Part 1: Exploring the Power of Observability and OpenTelemetry

Service Mesh at Scale

Compare Capabilities of the Top Service Mesh Platforms

Compare Capabilities of the Top API Gateways

Establishing zero trust security for modern cloud architectures

Unlocking the Power of Your API Gateway

API Gateways: Productivity, Resilience, and Security for Next-Generation Cloud Applications

Driving Business Value with Istio

Service Mesh Vendor Comparison

Istio Then & Now

4 Reasons Why You Need an AI Gateway

Gloo Gateway vs. Kong

Gloo Gateway vs. Apigee

3 Reasons You Need an API Gateway for Microservices Apps

Solo Academy Course: Service Mesh Basics

Solo Academy Course: Istio Basics

Solo Academy Course: Envoy Basics

Solo Academy Course: API Gateway Basics

Solo Academy Course: Get Started with Istio Service Mesh

Solo Academy Course: Introduction to Envoy Proxy

Solo Academy Course: Deploying Istio for Production

Kgateway Lab: Integrating kgateway with Istio at Ingress

Kgateway Lab: Kgateway as a Waypoint

Kgateway Lab: Canary releases with Argo Rollouts & kgateway

Kgateway Lab: Understanding kgateway and Gateway API policy attachments

Kgateway Lab: Gateway API support for service mesh with kgateway

Kgateway Lab: Exploring HTTPRoute resource configurations with kgateway

Kgateway Lab: Configuring gateways across multiple teams with kgateway

Kgateway Lab: Configure HTTPS with the Gateway API and kgateway