Last updated: January 2026

Here is the number that reframes this paper: in agentic workloads, KV-cache hit rates can exceed 95%. The model barely needs to recompute anything. It mostly needs to load cached data, and that loading step may be where performance starts to break down.
A new paper from DeepSeek, Tsinghua, and Peking University makes this case with hard data and a concrete fix. The paper is called DualPath: Breaking the Storage Bandwidth Bottleneck in Agentic LLM Inference, and the core claim is straightforward: for AI agents running long, multi-turn sessions, the bottleneck may shift away from GPU compute and toward storage I/O.
The Problem Nobody Talks About
Most LLM optimization work focuses on making models compute faster. Bigger GPUs, better kernels, more efficient attention mechanisms. That makes sense for single-turn inference where you process a prompt and generate a response.
But agents don’t work that way. A coding assistant might run 50-100 turns of interaction, calling tools, reading files, executing code. Each turn adds a few hundred tokens to a context that keeps growing. The good news: 95%+ of that context is already cached as KV-Cache from previous turns. The bad news: loading that cache from storage has become the actual bottleneck.
The standard architecture for large-scale inference is Prefill-Decode disaggregation (PD separation). Prefill engines handle the heavy initial computation. Decode engines handle token-by-token generation. They’re separated for efficiency. But this creates a lopsided I/O problem:
- Prefill engines: storage NICs maxed out loading massive KV-Caches
- Decode engines: storage NICs sitting mostly idle
- Result: one side is choking while the other side wastes bandwidth
GPU horsepower doesn’t matter when the data can’t get there fast enough.
DualPath: Use All the Bandwidth You Already Have
The fix is elegant in concept. Instead of loading KV-Cache only through the prefill engine’s storage path, DualPath adds a second route: storage → decode engine → prefill engine via RDMA network.
Two paths now exist:
- Storage → Prefill Engine (the traditional path)
- Storage → Decode Engine → Prefill Engine (the new path)
A global dynamic scheduler decides in real-time which path to use based on current load. The decode engines, which previously sat around with idle storage bandwidth, now share the I/O burden.
No data compression. No cache reduction. No new hardware. Just smarter routing of data that was already flowing through the system. The entire modification was about 5,000 lines of code on top of DeepSeek’s existing inference framework.
The Numbers
DeepSeek tested DualPath across three models: their own V3.2 660B (MoE architecture), a 27B scaled-down variant, and Qwen2.5-32B (dense GQA model). The results held across all of them:
- On the 660B model, DualPath cut job completion time by up to 1.87x compared to the baseline, approaching the theoretical “zero I/O overhead” limit
- Across different prefill-to-decode node ratios, average speedup was 1.64x, peaking at 2.46x
- When they deliberately increased per-turn compute (longer appended tokens), the baseline gradually caught up to DualPath, confirming the bottleneck is I/O, not compute
- SGLang + Mooncake couldn’t even complete some large-scale configurations that DualPath handled fine
The speedup pattern is the key point: when I/O dominates, DualPath helps a lot. When compute dominates, the advantage narrows. In the paper’s reported settings, the trade-off still looked favorable.
Why This Matters for DeepSeek V4
This paper didn’t drop in isolation. DeepSeek V4 Lite (codename “sealion-lite”) is reportedly in heavy testing with a 1 million token context window and native multimodal support. At least one inference provider has access under strict NDA.
A million-token context means enormous KV-caches. Enormous KV-caches mean much heavier I/O pressure. DualPath looks like one plausible infrastructure response if DeepSeek really is pushing toward that scale.
There’s also the hardware angle. Reports suggest DeepSeek has given domestic chip makers including Huawei early access to V4 for software adaptation, breaking from the usual pattern of prioritizing Nvidia optimization. When the bottleneck shifts from raw compute to data scheduling, the competitive dynamics between chip vendors change too. Bandwidth management and network architecture start mattering as much as FLOPS.
The Bigger Picture
The AI infrastructure conversation is still dominated by “how many GPUs” and “how many FLOPS.” DualPath is a reminder that systems engineering can matter just as much. The paper’s core insight, that many agentic workloads become I/O-bound before they become compute-bound, has implications beyond DeepSeek:
- Every company running AI agents at scale will hit this same wall
- The fix isn’t buying more hardware; it’s using existing hardware smarter
- As context windows grow (1M tokens and beyond), this problem only gets worse
DualPath is built on CUDA and still lives within the GPU ecosystem. But it points toward a future where the strongest inference stack may not just be the one with the fastest chips. It may be the one with the best data plumbing.
Related reading: