Recently, Red Hat, Google, CoreWeave and IBM announced the llm-d project, a distributed inference serving framework for Kubernetes built on vLLM and kgateway. But what does that really mean in practice? And more importantly, do you actually need distributed inferencing?
Many of the organizations I work with are choosing to run inference workloads on their own infrastructure using open-source models and GPUs. The motivations vary: some need better performance, others have strict data privacy or compliance requirements, and of course, there are always cost considerations.
The key innovation behind llm-d is how it distributes inference. It splits the inference process into two distinct phases, prefill and decode, and runs each in separate workloads (pods). The project calls this approach serving "disaggregation."
- Prefill takes the user’s prompt, processes the tokens into vectors, and stores them in a key-value (KV) cache. This stage is extremely compute-heavy, but also highly parallelizable.
- Once the cache is built, decode kicks in. It reads from the cache and generates the response tokens. This is a lighter process, but it’s latency-sensitive and benefits from fast access to prefilled data.
When large prompts share common prefixes, like in multi-turn chats, RAG-based interactions, or agentic workflows with long system prompts, the same cache can be reused across requests. That means faster responses and lower compute requirements. And this reuse happens more often than you'd think. Consider a chat with an assistant: each turn in the conversation often includes the full history or a consistent system prompt. With agentic systems, the context or memory passed to the LLM stays fairly stable between turns. If you're recomputing all of that from scratch every time, you're burning unnecessary GPU cycles.
That's where llm-d shines. It manages this distributed inference, cache, indexing, and routing to achieve faster inference, more efficient use of GPU resources, and lower overall costs. It uses runtime metrics about the individual inference workloads (like KV cache utilization, work queue depth, knowledge about which prefixes are in which cache on which workload, etc) to implement smart LLM endpoint routing.
Let's take a look at concrete examples to understand how llm-d works.
Meet Our Inference Workload Cluster
Let's start with an example cluster setup. Note we have specific "prefill" and "decode" workloads. We also have a workload that can handle "both" phases. As we'll see in the example walkthrough, there are scenarios where we use the disaggregated / distributed inferencing, and others where we don't need to and a single workload can handle both phases. In a real environment, you'd have some ratio of prefill/decode/both workloads.
Pods:
- name: "vllm-prefill-1" # Specialized for prompt processing
role: "prefill"
metrics: {waitingQueue: 2, kvCacheUsage: 30%}
- name: "vllm-prefill-2" # Another prefill specialist
role: "prefill"
metrics: {waitingQueue: 0, kvCacheUsage: 10%}
- name: "vllm-decode-1" # Optimized for token generation
role: "decode"
metrics: {waitingQueue: 1, kvCacheUsage: 60%}
- name: "vllm-decode-2" # Another decode specialist
role: "decode"
metrics: {waitingQueue: 0, kvCacheUsage: 40%}
- name: "vllm-both-1" # Flexible pod for both phases
role: "both"
metrics: {waitingQueue: 3, kvCacheUsage: 80%}

The llm-d scheduler implements a multi-stage scheduling pipeline:

Stage 1: Filtering
- Role-based filtering: Prefill vs decode vs both
- Load-based filtering: Exclude overloaded pods
- Capacity filtering: Skip pods near memory limits
Stage 2: Scoring (Weighted)
- Load awareness: Queue length and resource usage
- Prefix matching: Cache hit probability
- Session affinity: Conversation continuity
- KV cache optimization: Memory reuse patterns
Stage 3: Selection
- Weighted combination: All scores combined with configurable weights
- Tie-breaking: Random selection among equals
- Fallback: Graceful degradation when no ideal pods exist
To see the following scenarios in a visual form, take a look at this video:
Example 1: The Simple Case - Smart Load Balancing
Request arrives:
{
"model": "llama-7b",
"prompt": "Hello, how are you?", // Short: ~5 tokens
"max_tokens": 50
}
Scheduler's decision process:
- Length Analysis: 5 tokens < 100 token threshold → Single pod sufficient
- Filtering: Only decode-capable pods considered (decode-1, decode-2, both-1)
- Scoring: Load-aware algorithm evaluates queue lengths:
- decode-1: 1 request waiting → score = 4.96
- decode-2: 0 requests waiting → score = 5.0 ⭐
- both-1: 3 requests waiting → score = 4.88
Result: Request routed to vllm-decode-2 (shortest queue)

In this case, we determined that the number of tokens in the prompt (5) did not meet a configurable threshold (100) to trigger disaggregated / distributed inference. The scheduler eliminated the prefill workloads and just considered the decode and both workloads. It scored the workloads according to its runtime characteristics (KV cache, load, and cache hit probability -- though that metric is not shown) and used the resulting scores to pick the destination workload endpoint.
In the next example, we look at how to trigger disggregated / distributed inference.
NOTE: llm-d builds on top of Gateway API Inference Extension -- GAIE -- to help filter and score workloads. kgateway is a reference implementation of GAIE which is used in the llm-d project.
Example 2: The Complex Case - Disaggregated Processing
Request arrives:
{
"model": "llama-7b",
"prompt": "You are a helpful AI assistant specialized in code review. Please analyze the following Python function and suggest improvements: def calculate_factorial(n): ...", // Long: ~150 tokens
"max_tokens": 200
}
The scheduler detects that this is a prompt which triggers disaggregated prefill/decode processing.
Decision process:
- Length Analysis: 150 tokens > 100 → Disaggregated mode activated
- Decode Pod Selection:
- Discovers vllm-decode-2 has cached similar "code review assistant" prefixes (40% match)
- Winner: decode-2 (load score: 5.0 + prefix bonus: 2.0 = 7.0)
- Prefill Pod Selection:
- Finds vllm-prefill-1 has 60% prefix match for "code review" patterns
- Winner: prefill-1 (despite higher load, prefix cache wins)
Result:
- Prefill: vllm-prefill-1 processes the prompt
- Decode: vllm-decode-2 generates tokens
- Header added: x-prefiller-url: http://10.0.1.10:8000

In this scenario, llm-d finds that the prompt is larger, and above the specified threshold to trigger "PD" (prefill/decode disaggregation).
Under the covers, llm-d uses a smarter algorithm than just “size of tokens” in the prompt. It takes into account how much of the prompt is already cached and tries to determine how much actual work would need to compute new vectors for the prompt and takes a "weight" of the size of the prompt).
In this case, the selection logic will select a decode workload AND a prefill workload. It will then route the request to the decode workload with a header "x-prefiller-url" pointing to the prefill worker. The decode worker will then send the prompt to the prefill worker (based on the x-prefiller-url) to work on the prompt token vectors. The prefill will then hand back to the decode workload to finish computing tokens to be returned.
You may ask, does all of this coordination across multiple workloads for a single prompt end up being less efficient? And the answer is: it depends on the nature of the prompt. That's why there is a configurable threshold (PD_PROMPT_LEN_THRESHOLD) that triggers the disaggregation. This value can be configured to get you the best results. At some threshold, there will be advantages to distributing the inference across multiple workloads.
Example 3: Crisis Management - High Load Scenarios
System state: All pods under heavy load
vllm-decode-1: {waitingQueue: 15, kvCache: 0.9}
vllm-decode-2: {waitingQueue: 8, kvCache: 0.7}
vllm-both-1: {waitingQueue: 20, kvCache: 0.95}
vllm-prefill-1: {waitingQueue: 10, kvCache: 0.8}
vllm-prefill-2: {waitingQueue: 5, kvCache: 0.6}
Request arrives:
{
"model": "llama-7b",
"prompt": "Translate this text to French: The quick brown fox...",
"criticality": "Sheddable" // Lower priority
}
Scheduler's crisis response:
- Queue Filtering: Automatically filters out pods with >10 waiting requests
- prefill-1: 10 requests → ❌ Filtered out
- prefill-2: 5 requests → ✅ Passes
- decode-1: 15 requests → ❌ Filtered out
- decode-2: 8 requests → ✅ Passes
- both-1: 20 requests → ❌ Filtered out
- Graceful Degradation: Only decode-2 remains viable
Result: System maintains service even under extreme load by intelligently shedding requests from overloaded pods.

In this case, we've managed to find a workload that can still handle the prompt. But we could have also discarded this prompt as it's marked "sheddable". This can happen when the inference gateway decides that a prompt is for a type of client (ie, some offline batch processing) that can be discarded if the system is under critical pressure.
Example 4: The Conversation Continues - Session Affinity
Request with session context:
{
"model": "llama-7b",
"prompt": "Continue our previous conversation about machine learning...",
"headers": {
"x-session-token": "dmxsbS1kZWNvZGUtMQ==" // Points to decode-1
}
}
Scheduler's session-aware decision:
- Session Detection: Decodes token → Points to vllm-decode-1
- Affinity Override: Session scorer gives decode-1 maximum score (20.0)
- Final Decision: decode-1 wins despite having a queue
Result: Conversation continuity maintained
In this scenario, the client includes a previously computed "session token" with the prompt. In a previous call, after a prefill/decode action has occurred, the inference gateway can compute the "x-session-token" (which is the decode workload that serviced the prompt) and the client app/agent can include this session identifier in subsequent requests. This will ensure that prompts are routed to the same decode worker which has the KV cache from the previously computed prompts. This simplifies routing and can improve performance.
The Future of LLM Infrastructure
As enterprises increasingly turn to self-hosted LLM infrastructure for reasons ranging from cost control to compliance, the need for smarter, more efficient inference routing becomes critical. The llm-d project offers a Kubernetes-native way to scale inference workloads intelligently by leveraging disaggregated processing, cache-aware scheduling, and runtime load metrics. Whether you're building agentic systems with long, reusable context or handling bursty traffic under tight latency budgets, llm-d's approach helps strike a practical balance between performance and resource efficiency. It’s a compelling direction for the future of LLM infrastructure, one that puts you in control of how and where inference happens.