The Evolution of KV Cache

MHA burned memory. MQA burned quality. Here's what actually worked.

By Pradip Tivhale

Table of Contents

  1. The Transformer Revolution — Where It All Began
  2. Understanding Keys, Values, and Queries
  3. What Is KV Cache and Why Does It Exist?
  4. The Problem — When Memory Becomes the Bottleneck
  5. Multi-Head Attention (MHA) — The Baseline
  6. Multi-Query Attention (MQA) — The First Big Fix
  7. Grouped-Query Attention (GQA) — The Best of Both Worlds
  8. Multi-Head Latent Attention (MLA) — DeepSeek's Breakthrough
  9. Cross-Layer Attention (CLA) — Sharing Across Layers
  10. Sliding Window & Local-Global Attention
  11. PagedAttention — Virtual Memory for AI
  12. KV Cache Eviction — Keeping Only What Matters
  13. KV Cache Quantization — Shrinking Each Number
  14. FlashAttention — Making Hardware Work Smarter
  15. Model-by-Model Comparison — The Complete Picture
  16. Kimi K2.5 — How Reasoning Changes for AI Agents
  17. The State of the Art (2026) & Future Directions
  18. References
Chapter 01

The Transformer Revolution — Where It All Began

June 2017. Eight researchers at Google drop a paper with one of the boldest titles in computer science history: "Attention Is All You Need." Inside was a new architecture they called the Transformer — and it would go on to power virtually every major AI system built in the decade that followed.

To appreciate why the Transformer mattered, you need to understand what came before it. The reigning champion of text processing was the Recurrent Neural Network (RNN). Imagine a relay race where each runner must memorize and recite every message carried by all previous runners before passing the baton. Runner #50 has to recite 49 accumulated messages, then add their own. Inevitably, early messages get garbled or lost. Worse, the race is strictly sequential: runner #10 can't start until runner #9 finishes, no matter how many spare runners you have standing around. That's the RNN — slow, forgetful, and impossible to speed up by adding more hardware.

The Transformer discarded that relay race entirely. Its breakthrough? Attention. Instead of passing messages down a chain one at a time, it spreads the entire text on a giant table and lets every word look at every other word simultaneously. Think of a room of 500 people at a networking event where everyone can hear every conversation happening at once and instantly decide which ones are relevant to them. No chain. No waiting. Massively parallel.

Vaswani, A., Shazeer, N., Parmar, N., et al. "Attention Is All You Need." NeurIPS 2017. arXiv: 1706.03762

The numbers spoke for themselves. On the WMT 2014 English-to-German translation benchmark, the Transformer hit a BLEU score of 28.4 — blowing past the previous best by over 2 points. In machine translation, that's not an incremental gain; it's a big jump. On top of that, training was much faster because the architecture could chew through all tokens in parallel rather than one at a time.

What followed was an explosion. BERT. GPT-2. GPT-3. GPT-4. LLaMA. Gemma. Qwen. DeepSeek. Mistral. Hundreds of models, all built on the same Transformer backbone. But as these models ballooned from millions to hundreds of billions of parameters — and as users demanded they handle longer and longer documents — a hidden cost quietly grew in the background. A cost that would eventually consume more GPU memory than the model weights themselves. That cost has a name: the KV Cache.

2017
Transformer Architecture Published
Vaswani et al. introduce self-attention, replacing RNNs and CNNs for sequence modeling.
2018–2019
BERT and GPT-2 Emerge
Transformers dominate NLP. Models grow from millions to billions of parameters.
2020–2022
The Scaling Era
GPT-3 (175B parameters), PaLM (540B). KV cache memory becomes a serious concern.
2023–2024
The Optimization Era
GQA, MLA, PagedAttention, KIVI — the industry fights the memory wall.
2025–2026
The Efficiency Era
TurboQuant, Gemma 4, DeepSeek-V3 — models run on phones with 128K+ context.
Chapter 02

Understanding Keys, Values, and Queries

To understand the KV Cache, we first need to understand how attention works. The Transformer uses a mechanism called Scaled Dot-Product Attention, and it relies on three things: Queries (Q), Keys (K), and Values (V).

Here is an analogy that makes it concrete. Imagine you are ordering at a massive food court with 200 stalls:

Query (Q)
Your Craving
"I want something spicy with noodles" — This is what the current word is looking for in context.
Key (K)
Menu Board Descriptions
"Sichuan stir-fried noodles" or "vanilla milkshake" — Each stall advertises what it offers so you can judge relevance.
Value (V)
The Actual Dish Served
The real food you receive — the nutritional content that gets "consumed" by the model's computation.

In the Transformer, each word (or "token") generates three things: a Query, a Key, and a Value — all represented as lists of numbers (vectors). The process works like the food court: your craving (Query) gets compared against every stall's menu board (Keys) to produce a relevance score. High-scoring stalls contribute more of their dish (Value) to your final plate. Low-scoring ones contribute almost nothing. The result is a weighted blend of Values, customized to what this particular token was "looking for."

The Attention Formula Attention(Q, K, V) = softmax(Q × KT / √dk) × V

In plain English: "Scan your craving (Q) against every stall's menu board (K), calculate a match percentage for each (softmax turns raw scores into percentages that add up to 100%), then blend the dishes (V) in proportion to how well they matched."

The √dk part is just a scaling factor to keep numbers from getting too large. It's the square root of the dimension of the key vectors.

How Big Are These Vectors?

Each Q, K, and V vector has a specific size called the head dimension (dh), typically 128 numbers. In a model like LLaMA-2 7B, there are 32 attention heads, each with dimension 128, giving a total hidden size of 4,096.

Why This Matters for Memory

During training, the model processes all tokens at once, so Q, K, and V are all computed together. But during inference (when the model generates text one token at a time), something important happens: the model generates one new token, but it needs to attend to ALL previous tokens. This means it needs the Keys and Values of every previous token — and that's where the KV Cache enters the picture.

Chapter 03

What Is KV Cache and Why Does It Exist?

Here's how a language model actually generates text. It doesn't write a whole sentence at once — it produces one token (roughly one word) at a time. Suppose the model has already written "The cat sat on the" and now needs to pick the next word. To make that decision, it has to run the attention mechanism, which means comparing the new token against every single previous token.

Now here's the wasteful part. Without caching, the model recomputes the Key and Value vectors for "The," "cat," "sat," "on," and "the" from scratch every single time it generates a new word. At token 1,000? It recomputes K and V for all 999 previous tokens. At token 1,001? All 1,000. You can see how this snowballs into an absurd amount of redundant work.

The KV Cache is the fix. It simply stores the Key and Value vectors once they're computed and reuses them for all future tokens. Each new token only needs to compute its own Q, K, V, then look up everything else from the cache. No redundant recomputation. Simple and effective.

Animation: Q × KT → Attention Scores → Weighted V → Output

Toy scaled-dot-product example: one 1×4 Query is compared against five cached Keys, scaled by √dk, then turned into attention weights and a weighted V output.

The Speed Benefit

Without KV Cache, generating a sequence of length n requires O(n²) total computation (because each new token must attend to all previous tokens, and this happens n times). With KV Cache, it drops to O(n) per new token, because we only compute attention for one new query against cached keys and values.

The Memory Cost

Here's the catch: those cached K and V vectors consume GPU memory, and they grow linearly with sequence length. The formula for KV Cache memory in FP16 (16-bit floating point, 2 bytes per number) is:

KV Cache Memory Formula (FP16) Memory = 2 × nlayers × nkv_heads × dhead × seq_len × batch_size × 2 bytes

The first "2" is for Keys and Values. The final "2 bytes" is for FP16 precision. Let's see what this means for real models:

Model Layers KV Heads Head Dim KV Cache / Token (FP16) KV Cache @ 4K Tokens KV Cache @ 128K Tokens
LLaMA-1 65B MHA 80 64 128 2,560 KB 10 GB 320 GB
Falcon-7B MQA 32 1 64 8 KB 0.03 GB 1.0 GB
Mistral 7B GQA+SWA 32 8 128 128 KB 0.5 GB 16 GB
Qwen3 235B GQA-4 94 4 128 189 KB 0.76 GB 24 GB
DeepSeek-V3 MLA 61 Latent (dc=512) 576* 69 KB 0.27 GB 8.6 GB
Gemma 4 31B GQA+LG 60 16 (all layers) 256 local / 512 global ~1,120 KB naive 1.1 GB ~41 GB†
†Gemma 4 31B: 60 layers (50 sliding, 10 global), all with 16 KV heads. Sliding layers use head_dim=256 and are capped at 1,024 tokens. Global layers use head_dim=512 with full context. Cache is sub-linear: ~1.1 GB at 1K but ~41 GB at 128K because global layers dominate at long contexts. Source: official config.json.
*MLA stores a compressed latent of 576 dims (dc=512 + dR=64) instead of separate K and V vectors. Formula: layers × 576 × 2 bytes.
KV cache per token calculations based on architectural parameters from respective model papers. DeepSeek-V3 KV cache figure (70 KB/token) from "DeepSeek-V3 Technical Report," arXiv: 2412.19437

The Memory Wall

For LLaMA-1 65B (pure MHA) at 128K context, the KV cache requires 320 GB — while the model weights in FP16 are about 130 GB. The cache is 2.5x the model. Even for a 7B model with MHA, the cache hits 64 GB at 128K context while the model is only 14 GB (4.5x ratio). This is the "memory wall" that forced the industry to rethink attention.

Animation: KV Cache Growing Token by Token

Illustrative LLaMA-2 7B FP16-style baseline: each new token adds one more cached K/V state for every layer. Watch GPU memory fill up over time.

Chapter 04

The Problem — When Memory Becomes the Bottleneck

The KV Cache creates three critical problems that limit how we can deploy and use large language models:

Memory Wall
GPU memory fills up before the model can process long documents
A 7B model at 128K context needs 64 GB just for the cache
Batch Size Limit
Fewer requests can be served simultaneously
Each user's request needs its own KV cache, multiplying memory usage
Bandwidth Bottleneck
Reading large caches from GPU memory slows down generation
Memory bandwidth, not compute, becomes the limiting factor

Why Memory Bandwidth Matters

An NVIDIA A100 GPU can perform 312 trillion floating-point operations per second (312 TFLOPS). That's a lot of raw math power. But it can only move data from its memory at about 2 terabytes per second. When a model generates text one token at a time, the actual math is trivial — multiply one query vector against a bunch of cached keys. The bottleneck isn't doing the multiplication; it's loading all those keys and values from memory fast enough. Engineers call this being memory-bandwidth bound.

Here's an analogy that makes this click. Imagine a pizza oven that can bake 1,000 pizzas per minute (the GPU's compute). Sounds amazing, right? Except the kitchen only has one narrow doorway for ingredients to come through (memory bandwidth). The oven sits idle most of the time, waiting for dough and toppings to arrive. Making the oven bigger doesn't help. You need either a wider doorway — or smaller pizzas.

The Memory Fragmentation Problem

There's another subtle issue. When serving multiple users simultaneously, each request may need a different amount of KV cache (because they have different conversation lengths). Traditional systems pre-allocate a fixed-size memory block for each request, leading to massive waste when the actual conversation is shorter than the maximum. The PagedAttention paper (Section 11) showed that existing systems waste 60-80% of KV cache memory due to fragmentation.

Kwon, W., Li, Z., Zhuang, S., et al. "Efficient Memory Management for Large Language Model Serving with PagedAttention." SOSP 2023. arXiv: 2309.06180

These problems inspired a decade of research that we'll explore in the following chapters. The solutions fall into several categories:

The Five Strategies for Taming KV Cache

1. Fewer KV heads: MQA, GQA, MLA — reduce how many separate Key-Value pairs you store
2. Smarter memory management: PagedAttention — eliminate waste in how memory is allocated
3. Smaller numbers: KIVI, KVQuant, TurboQuant — use fewer bits per number
4. Fewer tokens cached: H2O, SnapKV, StreamingLLM — don't cache everything
5. Shorter attention span: Sliding window, local-global — limit how far back the model looks

Chapter 05

Multi-Head Attention (MHA) — The Baseline

The original Transformer uses Multi-Head Attention (MHA). In MHA, the model runs multiple parallel "attention heads" — each one learns to focus on different types of word relationships. One head might specialize in tracking who did what to whom (subject-object). Another might track temporal cues ("before," "after," "meanwhile"). A third might notice negation patterns. They each develop their own specialty without being explicitly told what to learn.

In MHA, every attention head has its own separate set of Keys, Values, and Queries. For a model with 32 attention heads:

MHA KV Cache Size KV Cache per token = 2 × nlayers × nheads × dhead × 2 bytes

For LLaMA-2 7B: 2 × 32 layers × 32 heads × 128 dim × 2 bytes = 524,288 bytes = 512 KB per token.

MHA provides the best quality because each head maintains its own representation of Keys and Values. However, the KV cache grows proportionally to the number of heads, making it the most memory-expensive approach.

Multi-Head Attention (MHA): Each Head Has Its Own K and V (Two Separate Vectors)

32 Query Heads → 32 Key Heads + 32 Value Heads (1:1:1 mapping) QUERIES KEYS VALUES Qโ‚ Qโ‚‚ Qโ‚ƒ ... Qโ‚ƒโ‚‚ Kโ‚ Kโ‚‚ Kโ‚ƒ ... Kโ‚ƒโ‚‚ Vโ‚ Vโ‚‚ Vโ‚ƒ ... Vโ‚ƒโ‚‚ 32 K vectors each 128 dims × 2 bytes + 32 V vectors each 128 dims × 2 bytes = 512 KB/token total per layer × 32 layers (LLaMA-2 7B) ▲ These K and V rows are CACHED (the KV Cache)

Important: K and V Are Two Separate Vectors

A common source of confusion: the "KV" in "KV Cache" does not mean a single combined vector. K (Key) and V (Value) are two completely independent vectors, each of size head_dim (typically 128 numbers). They are computed by separate weight matrices (WK and WV), stored separately in memory, and used at different stages of the attention computation. K is used to calculate relevance scores (multiplied with Q), while V provides the actual content that gets weighted and summed. The "2" in the KV cache formula 2 × layers × heads × dim accounts for these being two distinct stored vectors per head.

Vaswani, A., et al. "Attention Is All You Need." NeurIPS 2017. arXiv: 1706.03762
Chapter 06

Multi-Query Attention (MQA) — The First Big Fix

November 2019. Noam Shazeer — one of the eight original Transformer authors — publishes a short, almost casual paper with an idea so simple it seems like it shouldn't work: make all the attention heads share a single set of Keys and Values.

That's Multi-Query Attention (MQA) in one sentence. You still have 32 query heads asking 32 different questions about the text. But instead of each head maintaining its own personal reference sheet (its own Keys and Values), they all consult the exact same sheet. Imagine 32 food critics at a restaurant — each one evaluates the meal from a different angle (texture, aroma, plating, spice level), but they're all tasting from the same set of plates. One kitchen. Thirty-two opinions.

MQA KV Cache Size KV Cache per token = 2 × nlayers × 1 × dhead × 2 bytes

For a model with 32 query heads but only 1 KV head, the KV cache shrinks by 32× compared to MHA!

Shazeer, N. "Fast Transformer Decoding: One Write-Head is All You Need." arXiv: 1911.02150, November 2019.

The Results

Shazeer demonstrated that since autoregressive decoding is bottlenecked by memory bandwidth (not compute), shrinking the KV cache directly speeds up token generation. In practice, models using MQA generated tokens significantly faster, while benchmark scores dropped only slightly compared to the full MHA setup.

The Trade-off

While MQA provides the maximum memory savings, forcing all heads to share a single KV representation can hurt model quality, especially for complex reasoning tasks. This quality concern led to the development of Grouped-Query Attention (next chapter).

Multi-Query Attention (MQA): All Heads Share One K and One V

32 Query Heads → 1 shared K + 1 shared V (32:1 mapping) Qโ‚ Qโ‚‚ Qโ‚ƒ ... Qโ‚ƒโ‚‚ Kโ‚ Vโ‚ ▲ Only these two vectors cached 1 K + 1 V = 0.5 KB/token/layer 32× reduction!

Models Using MQA

MQA was adopted by several notable models including PaLM (Google, 2022), Falcon-7B (TII, 2023), and StarCoder (BigCode, 2023). Note: Falcon-40B and Falcon-180B use GQA (8 KV heads), not MQA — only the 7B variant uses true single-head MQA. Concerns about quality degradation led most subsequent models to prefer GQA instead.

Chapter 07

Grouped-Query Attention (GQA) — The Best of Both Worlds

By mid-2023, the AI community faced a dilemma: MHA gave the best quality but was memory-hungry, while MQA saved memory but sacrificed accuracy. Joshua Ainslie and his team at Google asked the natural question: what if we do something in between?

Their answer was Grouped-Query Attention (GQA). Rather than giving each query head its own private KV pair (MHA) or forcing all query heads to share one KV pair (MQA), GQA organizes query heads into small groups. Each group gets its own dedicated KV pair. For instance, with 32 query heads split into 8 groups, every 4 query heads share one KV head.

Ainslie, J., Lee-Thorp, J., de Jong, M., et al. "GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints." EMNLP 2023. arXiv: 2305.13245
GQA KV Cache Size (with G groups) KV Cache per token = 2 × nlayers × G × dhead × 2 bytes

where G = number of KV head groups (MHA: G = n_heads, MQA: G = 1)

Key Findings from the Paper

The paper made two important contributions:

1. Uptraining Recipe: Existing MHA models can be converted to GQA using only 5% of original pre-training compute. This means you don't need to retrain from scratch.

2. Quality-Speed Trade-off: GQA with 8 KV heads achieves quality close to MHA while maintaining speed comparable to MQA. It's the Goldilocks solution.

Grouped-Query Attention (GQA): 4 Query Heads Share 1 K + 1 V Per Group

32 Query Heads ÷ 8 groups = 4 queries per group, each sharing 1 K + 1 V Group 1 Qโ‚ Qโ‚‚ Qโ‚ƒ Qโ‚„ Kโ‚ Vโ‚ Group 2 Qโ‚… Qโ‚† Qโ‚‡ Qโ‚ˆ Kโ‚‚ Vโ‚‚ . . . Group 8 Qโ‚‚โ‚‰ Qโ‚ƒโ‚€ Qโ‚ƒโ‚ Qโ‚ƒโ‚‚ Kโ‚ˆ Vโ‚ˆ 8 groups × (1 K + 1 V) = 16 vectors cached — 4× less than MHA's 64 32 Q heads ÷ 8 groups = 4 Q per group. Each group: 4 Q share 1 K + 1 V. Total cached: 8K + 8V = 16 vectors.

GQA's Impact on the Industry

GQA quickly became the de facto standard for modern large language models. The following table shows its widespread adoption:

Model Year Attention Type Query Heads KV Heads KV Reduction vs MHA
LLaMA-2 7B/13B 2023 MHA 32 / 40 32 / 40 1× (baseline)
LLaMA-2 70B 2023 GQA 64 8
LLaMA-3 8B/70B/405B 2024 GQA 32/64/128 8 4× / 8× / 16×
Mistral 7B 2023 GQA 32 8
Qwen-2 2024 GQA varies varies 4–8×
Gemma 2 2024 GQA varies varies 4–8×
Touvron, H., et al. "Llama 2: Open Foundation and Fine-Tuned Chat Models." arXiv: 2307.09288, 2023.
Llama Team, AI @ Meta. "The Llama 3 Herd of Models." arXiv: 2407.21783, 2024.
Chapter 08

Multi-Head Latent Attention (MLA) — DeepSeek's Breakthrough

May 2024. While the rest of the industry was settling into GQA as "good enough," a Chinese AI lab called DeepSeek dropped a paper that made everyone do a double take. Their approach didn't just tweak the number of KV heads — it threw out the idea of storing Keys and Values at all. Instead, DeepSeek compressed the full K and V information for each token into a tiny "latent" vector, then reconstructed the actual Keys and Values on-the-fly during inference from this compressed seed.

DeepSeek-AI. "DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model." arXiv: 2405.04434, May 2024.

How MLA Works (The Intuition)

Think of it in terms of music. MHA is like storing full uncompressed WAV recordings of every instrument in an orchestra — pristine audio, but terabytes of disk space. GQA is like grouping instruments into sections (strings, brass, woodwinds) and keeping one recording per section. MLA goes further still: it doesn't store any audio at all. Instead, it stores a tiny MIDI-like encoding — just the notes, velocities, and timing — from which a synthesizer can reconstruct a faithful rendition of the full orchestra on demand. That MIDI encoding is the "latent vector," and the synthesizer is a learned matrix multiplication built into the model weights.

Mathematically, instead of storing the full Key matrix K (size: n_heads × d_head) and Value matrix V (size: n_heads × d_head) for each token, MLA stores a single compressed vector ct of dimension dc, where dc is much smaller than the full KV size.

MLA: Low-Rank Joint KV Compression ct = WDKV × ht     (compress hidden state to latent)

Kt = WUK × ct     (reconstruct Keys from latent)
Vt = WUV × ct     (reconstruct Values from latent)

Only ct is stored in cache. WUK and WUV are fixed model weights, not cached.

The Numbers Tell the Story

93.3%
KV Cache Reduction
vs. DeepSeek-67B (MHA)
5.76×
Maximum Generation
Throughput Improvement
70 KB
KV Cache Per Token
in DeepSeek-V3

DeepSeek-V2 MLA Configuration

In DeepSeek-V2, the specific parameters are:

The KV cache per token requires only (dc + dhR) = 576 elements, which is equivalent to GQA with only 2.25 KV groups. For comparison, standard MHA with 128 heads would need 2 × 128 × 128 = 32,768 elements per layer.

DeepSeek-V3: Scaling MLA Further

In December 2024, DeepSeek released V3, which continued using MLA and achieved even more impressive results:

DeepSeek-AI. "DeepSeek-V3 Technical Report." arXiv: 2412.19437, December 2024.
Model Attention Type Total Params KV Cache / Token vs. LLaMA-3 405B
LLaMA-3.1 405B GQA (8 heads) 405B 504 KB 1× (baseline)
Qwen-2.5 72B GQA 72B 320 KB 0.63×
DeepSeek-V3 MLA 671B (37B active) 69 KB 0.14× (7.3× smaller)

Why MLA Matters

MLA achieves better quality than MHA (not just comparable) while using far less KV cache memory. The DeepSeek-V2 paper showed MLA actually outperforms MHA on benchmarks, possibly because the compression acts as regularization. This killed the longstanding assumption that reducing KV cache always costs you quality.

Interactive Comparison

Live: How KV Cache Works in MHA vs MQA vs GQA vs MLA

The animation below runs all four attention methods side by side. Watch tokens arrive one at a time and see how much cached attention state each method stores. This is a normalized 32-layer FP16 comparison so the methods are compared on the same footing.

Chapter 09

Cross-Layer Attention (CLA) — Sharing Across Layers

All the methods discussed so far (MHA, MQA, GQA, MLA) reduce KV cache along the "head" dimension — fewer or compressed heads. But what about another dimension: layers?

In May 2024, researchers from Microsoft proposed Cross-Layer Attention (CLA): instead of each transformer layer maintaining its own KV cache, adjacent layers can share their Key-Value pairs.

Brandon, W., Mishra, M., Nrusimha, A., et al. "Reducing Transformer Key-Value Cache Size with Cross-Layer Attention." arXiv: 2405.12981, May 2024.

The idea is simple: divide the transformer layers into groups (e.g., pairs of adjacent layers), and within each group, all layers use the Keys and Values from the bottom layer of the group. This means you only need to store KV cache for half the layers.

Results

CLA can reduce the KV cache size by while maintaining nearly the same accuracy as standard MQA. Importantly, CLA is orthogonal to head-level compression — it can be combined with GQA or MQA for even greater savings. For example, GQA + CLA could achieve a combined 8–16× reduction over MHA.

Chapter 10

Sliding Window & Local-Global Attention

Another approach to reducing KV cache is limiting how far back the model looks. Instead of attending to every previous token (potentially hundreds of thousands), the model only looks at a fixed window of recent tokens.

Sliding Window Attention (Mistral 7B)

In October 2023, Mistral AI published the Mistral 7B model, which introduced Sliding Window Attention (SWA) with a window size of 4,096 tokens. Each layer only attends to the previous 4,096 tokens, meaning the KV cache per layer is capped at a fixed size regardless of the total sequence length.

Jiang, A.Q., et al. "Mistral 7B." arXiv: 2310.06825, October 2023.

The key insight is that information can still flow across the entire sequence through stacked layers. With a window of W = 4,096 and k layers stacked, the effective attention span is W × k. For Mistral 7B with 32 layers, this gives a theoretical span of approximately 131,000 tokens even though each layer only looks at 4,096.

Interleaved Local-Global Attention (Gemma 2 and Gemma 3)

Google's Gemma models took a more nuanced approach by interleaving two types of attention layers:

Model Local Window Global Layers Local:Global Ratio KV Cache Benefit
Gemma 2 4,096 tokens Every other layer 1:1 ~50% of full attention cache
Gemma 3 1,024 tokens Every 6th layer 5:1 Major reduction for 128K context
Gemma 4 (31B / E4B / E2B) 1024 (31B) / 512 (E4B, E2B) Every 6th layer 5:1 128K+ context; E4B shares KV across 18 layers; 31B does not share (num_kv_shared_layers=0)
Gemma Team, Google. "Gemma 2: Improving Open Language Models at a Practical Size." arXiv: 2408.00118, 2024.
Gemma Team, Google. "Gemma 3 Technical Report." arXiv: 2503.19786, 2025.

Local layers use sliding window attention: they only look at nearby tokens, so their KV cache is small and fixed. Global layers use full attention: they look at all tokens, requiring a larger KV cache. By making most layers local (5:1 ratio in Gemma 3), the total KV cache is drastically smaller.

Gemma 3's Clever Design

By using a 5:1 local-to-global ratio with only 1,024-token local windows, Gemma 3 can handle 128K token contexts on accessible hardware. Only 1 out of every 6 layers needs to store full-length KV cache. The other 5 layers only store 1,024 tokens each — a fraction of the full sequence. This allows Gemma 3 27B to fit on a single GPU with 128K context.

Chapter 11

PagedAttention — Virtual Memory for AI

Everything we've discussed so far makes the KV cache theoretically smaller. But there's a different, more mundane problem: even when the cache should fit in GPU memory, the software managing that memory does an awful job. Chunks get allocated, freed, re-allocated in different sizes — and before long, the GPU's memory looks like a parking lot full of oddly-spaced cars with no room for a new one.

In September 2023, Woosuk Kwon and colleagues at UC Berkeley published a fix that was hiding in plain sight for decades — in your computer's operating system, of all places. The idea: treat GPU memory for KV cache the same way Linux treats RAM. Break it into fixed-size pages. Don't require them to be contiguous. Map them with a lightweight table. They called it PagedAttention.

Kwon, W., Li, Z., Zhuang, S., et al. "Efficient Memory Management for Large Language Model Serving with PagedAttention." SOSP 2023. arXiv: 2309.06180

The Problem: Memory Fragmentation

In traditional LLM serving, the system pre-allocates a contiguous block of memory for each request's KV cache based on the maximum possible sequence length. This leads to three types of waste:

Internal Fragmentation
Pre-allocated blocks are larger than actually needed
External Fragmentation
Free memory exists but in scattered, unusable chunks
Reservation Waste
Memory reserved for future tokens that may never be generated

The paper found that existing systems waste 60–80% of KV cache memory!

The Solution: Paging

PagedAttention divides the KV cache into fixed-size blocks (like pages in virtual memory). These blocks don't need to be contiguous in physical GPU memory — they can be scattered anywhere, connected by a simple lookup table. This eliminates fragmentation almost entirely.

Request arrives
Allocate KV blocks on demand
Blocks stored anywhere in GPU memory
Page table maps logical → physical

Results

The resulting system, called vLLM, posted clear gains:

2–4×
Throughput Improvement
vs. FasterTransformer and Orca
~0%
Memory Waste
Down from 60-80% in traditional systems
Free
KV Cache Sharing
Multiple requests can share common prefixes

vLLM became one of the most widely adopted open-source LLM serving frameworks. PagedAttention's approach to KV cache memory management has been adopted by several other serving systems including TensorRT-LLM and SGLang.

Chapter 12

KV Cache Eviction — Keeping Only What Matters

Another powerful strategy: instead of caching Keys and Values for every token, only keep the ones that actually matter. Research has shown that attention is highly concentrated — a small fraction of tokens receive the vast majority of attention.

H2O: Heavy-Hitter Oracle (NeurIPS 2023)

Zhang et al. analyzed how attention is distributed across tokens and found a clear pattern: the vast majority of the total attention score is concentrated on a tiny fraction of tokens — what they call "heavy hitters." Most tokens contribute almost nothing to the final output. Based on this observation, their H2O system retains only these high-impact tokens alongside a window of recent tokens in the KV cache, discarding everything else.

Zhang, Z., Sheng, Y., et al. "Hโ‚‚O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models." NeurIPS 2023. arXiv: 2306.14048

With only 20% heavy hitters retained, H2O improved throughput by up to 29× over DeepSpeed Zero-Inference and Hugging Face Accelerate on OPT-6.7B and OPT-30B models.

Scissorhands (NeurIPS 2023)

Liu et al. proposed the "Persistence of Importance" hypothesis: tokens that were important at one generation step will continue to be important in future steps. This insight allows the system to identify important tokens early and keep only those.

Liu, Z., Desai, A., et al. "Scissorhands: Exploiting the Persistence of Importance Hypothesis for LLM KV Cache Compression at Test Time." NeurIPS 2023. arXiv: 2305.17118

Scissorhands achieves up to 5× KV cache reduction without quality loss, and when combined with 4-bit quantization, reaches 20× compression.

StreamingLLM: Infinite Context with Fixed Cache (ICLR 2024)

Xiao et al. at MIT uncovered a curious behavior in transformer models: no matter what text you feed them, the very first few tokens in the sequence always attract unusually high attention scores — even when those tokens carry no meaningful content (like a period or a padding symbol). The researchers named these positions "attention sinks." By preserving just these initial sink tokens alongside a sliding window of the most recent tokens, the model can process infinitely long sequences with a fixed memory budget.

Xiao, G., Tian, Y., Chen, B., et al. "Efficient Streaming Language Models with Attention Sinks." ICLR 2024. arXiv: 2309.17453

StreamingLLM enabled models like Llama-2, MPT, and Falcon to process up to 4 million tokens with stable performance, achieving 22.2× speedup over sliding window recomputation baselines.

SnapKV (NeurIPS 2024)

Li et al. noticed something surprising: even before the model starts generating new tokens, the pattern of which prompt tokens each attention head focuses on is already stable and predictable. By examining attention weights in a small window near the end of the prompt, SnapKV can anticipate which earlier tokens will matter most and pre-select them for caching, discarding the rest.

Li, Y., et al. "SnapKV: LLM Knows What You are Looking for Before Generation." NeurIPS 2024. arXiv: 2404.14469
3.6×
Speed Increase
Generation speed at 16K tokens
8.2×
Memory Savings
Compared to full KV cache
380K
Max Tokens
On a single A100-80GB GPU

PyramidInfer and PyramidKV

Two complementary papers discovered that the number of important tokens decreases across layers. Lower layers need more cached tokens, while higher layers can function with very few. This pyramidal pattern allows allocating more cache budget to lower layers and less to higher layers.

Yang, D., et al. "PyramidInfer: Pyramid KV Cache Compression for High-throughput LLM Inference." arXiv: 2405.12532, 2024.
Cai, Z., et al. "PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling." arXiv: 2406.02069, 2024.

PyramidInfer achieved 2.2× throughput improvement with over 54% GPU memory reduction. PyramidKV matched full KV cache performance while retaining only 12% of the cache.

Chapter 13

KV Cache Quantization — Shrinking Each Number

Every approach so far has tried to cache fewer things. Quantization takes a completely different angle: cache the same things, but represent each number using fewer bits. It's the difference between storing fewer photos versus compressing each photo. In standard deployments, each number in the KV cache occupies 16 bits (FP16). The question quantization researchers ask: can we get away with 8, 4, or even 2 bits?

What is Quantization?

Think about how you describe someone's height. With FP16 precision, you'd say "175.38 cm" — exact to the fraction of a millimeter. With INT8, you'd round to "175 cm." With INT4, maybe "tall" or "short" with a few gradations. With INT2, you get just four buckets: "very short," "short," "tall," "very tall." Each step saves storage space but throws away some nuance. The gamble with KV cache quantization: can the model still reason accurately when its cached memories are stored as rough sketches instead of precise measurements?

Animation: KV Cache Quantization — FP16 → INT8 → INT4 → INT2

Illustrative scalar view of KV quantization: the values are rounded more aggressively as bit width drops. Real systems such as KIVI and TurboQuant use more sophisticated schemes than this simple bar demo.

KIVI: 2-Bit Quantization (ICML 2024)

Liu et al. stumbled onto something that previous quantization work had missed. When you look at the raw numbers inside Keys and Values, they have different shapes of outliers. The Key matrix has certain channels (columns) that consistently contain huge values across all tokens — so you need to calibrate each channel separately. The Value matrix is the opposite: certain tokens (rows) spike across all channels. KIVI exploits this asymmetry by using per-channel quantization for Keys and per-token quantization for Values, squeezing both down to just 2 bits per number — a raw 8× compression from FP16.

Liu, Z., Yuan, J., Jin, H., et al. "KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache." ICML 2024. arXiv: 2402.02750
2.6×
Peak Memory Reduction
Including model weights
Batch Size Increase
More users served simultaneously
2.35–3.47×
Throughput Improvement
On real LLM workloads

Across LLaMA, Falcon, and Mistral model families, KIVI preserved benchmark accuracy with negligible degradation — all without any retraining or fine-tuning step.

KVQuant: Toward 10 Million Context (NeurIPS 2024)

Hooper et al. pushed quantization further with several innovations: per-channel key quantization, pre-RoPE key quantization, non-uniform quantization, and per-vector dense-and-sparse quantization.

Hooper, C., Kim, S., et al. "KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization." NeurIPS 2024. arXiv: 2401.18079

The results enabled:

TurboQuant: Near-Optimal Compression (2025)

Google's TurboQuant represents the cutting edge of KV cache quantization. Unlike previous methods that required careful calibration data or model-specific tuning, TurboQuant is completely data-oblivious — it works on any model without seeing any data.

Google Research. "TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate." arXiv: 2504.19874, 2025.

The method works by:

  1. Randomly rotating input vectors to distribute information evenly across coordinates
  2. Applying optimal scalar quantizers to each coordinate independently
  3. Using a 1-bit Quantized JL transform on residuals for unbiased inner products
3.5 bits
Quality Neutral
Absolute zero quality loss
2.5 bits
Marginal Degradation
Nearly imperceptible quality loss
4–7×
Compression Ratio
vs FP16 baseline

Bonsai 8B + Turbo1Bit: 1-Bit Weights Meet KV Cache Compression

All the quantization methods above shrink the KV cache. But what happens when you also shrink the model weights to the extreme? In mid-2025, Caltech spinoff PrismML released Bonsai 8B — a language model where every single weight is represented as just one bit: either +1 or −1. An 8.2 billion parameter model that fits in 1.15 GB (compared to ~16 GB for a standard FP16 8B model). That's a 14× size reduction for the weights alone.

PrismML / Caltech. "Announcing 1-bit Bonsai: The First Commercially Viable 1-bit LLMs." prismml.com, 2025. Whitepaper: GitHub

But here's the catch that relates to our KV cache story: 1-bit weights do NOT mean 1-bit KV cache. The weights (WK, WV, WQ, etc.) are 1-bit, but when they multiply against the input activations, the resulting Key and Value vectors are still full-precision floating point numbers. So even though Bonsai's weights are tiny, the KV cache during inference remains FP16-sized and grows linearly with context length — exactly the same memory wall problem.

This is where Turbo1Bit comes in: an open-source project that stacks TurboQuant-style KV cache quantization on top of Bonsai's 1-bit weights.

Turbo1Bit. "Combining 1-bit LLM weights (Bonsai) with TurboQuant KV cache compression." GitHub, 2025.
Config (Bonsai 8B @ 65K context) Model Weights KV Cache Total Memory Reduction
Standard FP16 8B model ~16 GB ~8 GB ~24 GB 1× baseline
Bonsai 1-bit (no KV compression) 1.15 GB ~8 GB (FP16) 10.6 GB 2.3×
Bonsai 1-bit + Turbo1Bit (Q4_0 KV) 1.15 GB ~2 GB 4.0 GB

Turbo1Bit's KV cache compression findings for 1-bit models specifically:

KV Quantization Perplexity (WikiText-2) vs Baseline Memory Saving
FP16 (baseline)25.51
Q8_0 (8-bit)25.49−0.1%1.75×
Q5_0 (5-bit)25.87+1.4%2.5×
Q4_0 (4-bit)26.82+5.1%2.91×

The 1-Bit Insight: KV Cache Becomes the Dominant Cost

Bonsai reveals a striking reality about the future of efficient models. When you compress model weights to 1-bit (1.15 GB for an 8B model), the KV cache becomes the overwhelming majority of memory usage. At 65K context with FP16 KV cache, the cache is 7× larger than the model itself. This flips the traditional assumption that model weights dominate memory. For 1-bit models, KV cache optimization isn't optional — it's essential. This is why combining Bonsai with TurboQuant-style KV compression yields such dramatic results: total memory drops from 10.6 GB to 4 GB.

Expert Take: 1-Bit Only Matters If the KV Cache Falls Too

Bonsai 8B is a useful sanity check for experts because it cleanly separates weight compression from runtime memory. The weights are tiny at about 1.1 GB, but the local benchmark logs in PrismML/Bonsai-demo still show 4,608 MiB of FP16 KV cache at 32K context. In practice, that means the bottleneck has not disappeared — it has simply moved from the model file to the cache.

The more interesting deployment story is Bonsai 8B plus TurboQuant-style Q4_0 KV compression. In the same local logs, the KV cache falls from 4,608 MiB to 1,296 MiB at 32K, and total resident memory drops from 6,011 MiB to 2,699 MiB. At 60K context, the compressed run still lands at about 3,782 MiB total. That is the point where 1-bit weights stop being a neat packaging trick and start looking like a serious long-context systems design.

Proof points: the local benchmark artifacts in PrismML/Bonsai-demo/memory_benchmark_results.json and PrismML/Bonsai-demo/turboquant_kv_results.json record Bonsai-8B at 32K with 4,608 MiB FP16 KV vs 1,296 MiB Q4 KV, and total resident memory of 6,011 MiB vs 2,699 MiB. The companion Bonsai materials identify the 8B GGUF weights at roughly 1,099 MiB. Public references: Bonsai-demo, Turbo1Bit GitHub.
1.15 GB
Bonsai 8B Model Size
vs ~16 GB for FP16 8B model
70.5
Avg Benchmark Score
Beats LLaMA 3.1 8B (67.1)
44 tok/s
On iPhone 17 Pro
Full 8B model on a phone

Quantization Comparison Overview

Method Year Bits (K/V) Memory Reduction Needs Calibration? Quality Impact
FP16 (Baseline) 16 / 16 N/A Baseline
INT8 Quantization Various 8 / 8 Minimal Negligible
KIVI 2024 2 / 2 ~8× No (tuning-free) Near-zero loss
KVQuant 2024 3 / 3 ~5× Minimal <0.1 perplexity
TurboQuant 2025 2.5–3.5 4–7× No (data-oblivious) Near-zero loss
Chapter 14

FlashAttention — Making Hardware Work Smarter

FlashAttention is not a KV cache compression technique. But it changed how attention (and the KV cache) is read from GPU memory, making everything faster without touching the model architecture itself.

FlashAttention (NeurIPS 2022)

Tri Dao et al. observed that standard attention implementations waste enormous time moving data between GPU memory (HBM) and the fast on-chip memory (SRAM). FlashAttention uses a technique called "tiling": instead of computing the entire attention matrix at once (which requires materializing an enormous n×n matrix), it processes the attention in small blocks that fit in the fast on-chip SRAM.

Dao, T., Fu, D.Y., Ermon, S., et al. "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness." NeurIPS 2022. arXiv: 2205.14135

Key result: 2–4× speedup over optimized baselines, with linear memory usage instead of quadratic. This means the n×n attention matrix is never fully materialized in GPU memory.

FlashAttention-2 (2023)

Dao improved upon the original with better parallelism and work partitioning across GPU thread blocks.

Dao, T. "FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning." arXiv: 2307.08691, 2023.
1.7–3×
Faster than FlashAttention-1
3–10×
Faster than Standard Attention
72%
Model FLOP Utilization
225 TFLOPs/s on A100

FlashAttention is now integrated into most major LLM frameworks (vLLM, Hugging Face Transformers, PyTorch) and is widely used for reducing memory overhead and improving throughput during both training and inference.

Chapter 15

Model-by-Model Comparison — The Complete Picture

Let's bring everything together. The table below covers every major open model family from the original baseline through to mid-2026, showing exactly how each one handles the KV cache challenge.

Part A: Legacy & Foundational Models (2023–2024)

Model Year Params Attention Q Heads KV Heads Context KV/Token (FP16)
LLaMA-2 7B 2023 7B MHA 32 32 4K 512 KB
LLaMA-2 70B 2023 70B GQA 64 8 4K 320 KB
Mistral 7B 2023 7B GQA+SWA 32 8 32K (4K window) 128 KB
Gemma 1 7B 2024 7B MHA 16 16 8K 448 KB
LLaMA-3 8B 2024 8B GQA 32 8 128K 128 KB
LLaMA-3 405B 2024 405B GQA 128 8 128K 504 KB
Qwen-2 72B 2024 72B GQA 64 8 128K 320 KB
Gemma 2 27B 2024 27B GQA+LG 32 16 8K (4K local) Reduced (~50%)
Phi-4 14B 2024 14B GQA 24 8 16K 128 KB
DeepSeek-V2 2024 236B (21B active) MLA 128 Latent (dc=512) 128K ~Eq. 2.25 GQA groups
DeepSeek-V3 2024 671B (37B active) MLA+MoE 128 Latent (dc=512) 128K 70 KB

Part B: Frontier Models (2025–2026)

Model Year Params Attention Q Heads KV Heads Context Key Innovation
Gemma 3 27B 2025 27B GQA+LG 32 Varies 128K (1K local) 5:1 local-global ratio; only 1/6 layers full-attention
Qwen3 0.6B 2025 0.6B GQA 16 8 32K QK-Norm; no QKV-bias; tied embeddings
Qwen3 4B 2025 4B GQA 32 8 32K 36 layers; 4:1 Q-to-KV ratio
Qwen3 8B 2025 8.2B GQA 32 8 128K Thinking mode (hybrid reasoning)
Qwen3 32B 2025 32.8B GQA 64 8 128K 64 layers; 8:1 Q-to-KV ratio
Qwen3 235B-A22B 2025 235B (22B active) GQA+MoE 64 4 128K 128 experts × 8 active; 16:1 Q-to-KV
Qwen3.5 4B 2025 4B GQA 128K Latest Qwen iteration; improved reasoning
LLaMA 4 Scout 2025 109B (17B active) LongCtx+MoE 10M (!) Official model card lists a 10M-token context window; detailed KV internals are not broken out in this survey.
LLaMA 4 Maverick 2025 400B (17B active) LongCtx+MoE 1M Long-context multimodal MoE model; detailed KV internals not broken out here.
Kimi K2 2025 1.04T (32B active) MLA+MoE 64 Latent (MLA) 128K 384 experts (8 active); MuonClip optimizer; 15.5T tokens
Gemma 4 E2B 2026 ~5.1B (2.3B eff.) GQA+LG 8 1 128K+ PLE; KV sharing across 20/35 layers (num_kv_shared_layers=20)
Gemma 4 E4B 2026 ~8B (4B eff.) GQA+LG 8 2 128K+ PLE; 42 layers; head_dim 256/512; sliding_window=512; num_kv_shared_layers=18
Gemma 4 26B-A4B 2026 26B (3.8B active) GQA+LG+MoE 8 local / 2 global 128K+ 128 experts (8 active) + 3× shared expert
Bonsai 8B 2025 8.2B (1-bit weights!) GQA+1-bit 128K 1.15 GB model; KV cache is FP16 (weights 1-bit, cache is not). +Turbo1Bit for Q4 KV.
Gemma 4 31B 2026 31B (Dense) GQA+LG 16 (all layers) 128K+ 5:1 local-global; p-RoPE; unified K/V in global layers
Qwen Team. "Qwen3 Technical Report." arXiv: 2505.09388, 2025.
Kimi Team, Moonshot AI. "Kimi K2: Open Agentic Intelligence." arXiv: 2507.20534, 2025.
Gemma Team, Google. "Gemma: Open Models Based on Gemini Research and Technology." arXiv: 2403.08295, 2024.
Qwen Team. "Qwen2 Technical Report." arXiv: 2407.10671, 2024.
Microsoft. "Phi-4 Technical Report." arXiv: 2412.08905, 2024.

Exact KV Cache Memory Requirements — Calculated from config.json

Below are KV cache memory calculations derived from published config.json files where available. Rows marked with * are estimates due to hybrid architectures (mixed head dimensions, KV sharing, or sliding windows) where an exact single number is not straightforward.

Model Layers KV Heads Head Dim KV/Token @ 1K Tokens @ 128K Tokens Calculation Notes
LLaMA-1 65B
MHA
80 64 128 2,560 KB 2.5 GB 320 GB 2×80×64×128×2 = 2,621,440 B. Pure MHA: every head has its own K+V. Source: HF config.
Falcon-7B
MQA
32 1 64 8 KB 8 MB 1.0 GB 2×32×1×64×2 = 8,192 B. All 71 query heads share 1 K + 1 V (multi_query=true). Source: HF config.
Mistral 7B
GQA+SWA
32 8 128 128 KB 128 MB 128 MB* 2×32×8×128×2 = 131,072 B. *Sliding window caps cache at 4K tokens regardless of context.
Qwen3 235B
GQA-4+MoE
94 4 128 189 KB 189 MB 24 GB 2×94×4×128×2 = 193,536 B. Only 4 KV heads for 64 query heads (16:1 ratio).
DeepSeek-V3
MLA+MoE
61 Latent (dc=512) 576* 69 KB 69 MB 8.6 GB 61×(512+64)×2 = 70,272 B. Stores compressed latent, not raw K+V. *576 = dc+dR.
Gemma 4 31B
GQA+LG
60 (50 local + 10 global) 16 (all layers) 256 / 512 ~1,120 KB naive ~1.1 GB ~41 GB† †50 sliding layers (16 KV, d=256, capped at 1,024 tok) + 10 global layers (16 KV, d=512, full ctx). At 128K, global layers dominate: ~41 GB total. At 1K: ~1.1 GB. Source: official config.json.
Qwen3.5 9B
GQA+DeltaNet
32 (8 attn) 4 256 32 KB 32 MB 4 GB 2×8×4×256×2 = 32,768 B. Only 8/32 layers use attention; 24 use DeltaNet (no KV cache).
Same MLA config as DeepSeek-V3: kv_lora_rank=512, qk_rope_head_dim=64. 384 experts (vs 256 in DSV3).
Qwen3.5-9B, Gemma 4 E4B, and Kimi K2 parameters are sourced from official Hugging Face config/model pages (Qwen3.5-9B, Gemma 4 E4B, Kimi K2). LLaMA 4 Scout long-context claims are sourced from Meta's official model card; detailed attention internals are not treated as independently verified here.

The Standout: Qwen3.5 9B Uses Only 32 KB Per Token

Qwen3.5-9B achieves remarkably low KV cache by using a hybrid DeltaNet + Attention architecture. Out of 32 layers, only 8 use traditional attention (which requires KV cache). The remaining 24 layers use Gated DeltaNet — a form of linear attention that maintains a fixed-size recurrent state instead of a growing cache. This means 75% of the model's layers add zero KV cache overhead regardless of context length. Combined with just 4 KV heads and head_dim=256, the result is just 32 KB per token — an astounding 16× less than the LLaMA-2 7B baseline of similar parameter count. At 128K context, that's only 4 GB for the entire KV cache.

Gemma 4 E4B: Aggressive KV Sharing Across Layers

Gemma 4 E4B uses just 2 KV heads and shares KV cache across 18 of its 42 layers (num_kv_shared_layers=18). This means only 24 layers maintain unique KV caches. Sliding window layers are capped at 512 tokens regardless of context length. The result: a 4B-effective model that can handle 128K context with roughly 2 GB of KV cache, putting it in the mobile/edge-friendly range.

Expert Take: Gemma 4 E4B Is What Edge-Friendly Long Context Looks Like

Gemma 4 E4B is compelling because the base architecture already puts 128K context in roughly the 2 GB KV-cache range. That means the model is not relying on a heroic post-training trick to become usable on constrained hardware; it gets there by design through a very low KV-head count, cross-layer sharing, and local-global attention.

In our local turboquant-llamacpp measurements on Gemma 4 E4B running on an M2 Pro, TurboQuant-style 4-bit KV reduced cache memory from 104 MiB to 29 MiB at 4K, 296 MiB to 83 MiB at 16K, and 552 MiB to 155 MiB at 32K, staying near a 3.56× reduction throughout. If that same ratio holds at 128K, a ~2.1 GB FP16 KV budget lands around 600 MiB — which is exactly why Gemma-class models are so interesting for on-device long-context inference.

Proof points: Google's official Gemma 4 E4B model card lists 42 layers, a 512-token sliding window, and 128K context. The local TurboQuant measurements cited here come from the public turboquant-llamacpp README: 104 MiB → 29 MiB (4K), 296 MiB → 83 MiB (16K), and 552 MiB → 155 MiB (32K) on an M2 Pro. Sources: Gemma 4 E4B model card, turboquant-llamacpp README.

Kimi K2: Trillion Parameters, Same KV Cache as DeepSeek-V3

Despite scaling from 671B (DeepSeek-V3) to 1.04 trillion parameters (Kimi K2), the KV cache stays identical at ~70 KB per token. Both use MLA with kv_lora_rank=512 and qk_rope_head_dim=64 across 61 layers. The entire parameter increase went into more experts (384 vs 256) and expert capacity — not into attention overhead. This proves MLA's KV cost is decoupled from model scale.

What the Table Reveals

Three Distinct KV Cache Strategies Have Emerged

1. The GQA Camp (LLaMA, Qwen, Gemma, Phi, Mistral): Nearly every major model family standardized on GQA with 8 KV heads. The Qwen3 235B MoE pushes this further to just 4 KV heads — a 16:1 compression ratio. Simple, proven, universally adopted.

2. The MLA Camp (DeepSeek, Kimi K2): These models compress KV into tiny latent vectors, achieving the lowest per-token cache cost. Kimi K2 took DeepSeek's MLA and scaled it to 1 trillion parameters with 384 experts — the largest open MLA model to date.

3. The Hybrid Attention Camp (LLaMA 4, Gemma 3/4): These models attack the problem architecturally by mixing short-range local attention with sparse global attention. Meta's public LLaMA 4 Scout model card advertises a 10 million token context window. Gemma 4 uses KV sharing across layers plus hybrid head dimensions (256 for local, 512 for global).

KV Cache Memory Per Token Comparison (FP16)

Memory Reduction vs. FP16 MHA Baseline (Lower is Better)

Visual Summary: How Each Technique Reduces KV Cache

Stacking Optimizations

The real power comes from combining multiple techniques. A modern deployment might use:

GQA (4–8×) + Local-Global attention (3–5×) + TurboQuant (4–7×) + PagedAttention (eliminates waste) + SnapKV eviction (further reduction)

These compound: 8 × 4 × 5 = 160× theoretical reduction from the naive FP16 MHA baseline. No single production model uses all five simultaneously today, but the trend is clear: each new generation stacks more of these techniques together.

Chapter 16

Kimi K2.5 — How Reasoning Changes for AI Agents

Everything we've covered so far treats the model as a single entity: one prompt in, one response out. But the newest frontier in AI — agentic workflows — changes the game entirely. An AI agent doesn't just answer a question; it plans, breaks problems into steps, calls external tools (web search, code execution, APIs), reads the results, reasons about them, and repeats this cycle potentially hundreds of times. This creates a completely different stress profile for KV cache.

Moonshot AI's Kimi K2.5 (February 2026) is the first major open model specifically designed to handle this agentic paradigm at scale. Its technical report lays out how reasoning, memory, and KV cache management work differently when a model operates as an agent rather than a chatbot.

Kimi Team, Moonshot AI. "Kimi K2.5: Visual Agentic Intelligence." arXiv: 2602.02276, February 2026.

The Problem: Why Agents Break Traditional KV Cache

In a standard chatbot conversation, the context grows slowly — one user message, one model response, repeat. The KV cache grows predictably and stays within the context window.

An agent is nothing like this. Consider an agent researching a topic:

User Task
Plan Steps
Call Tool #1
Read Result (10K tokens)
Call Tool #2
Read Result (8K tokens)
... ×200 more calls

After 200 tool calls, the context might contain millions of tokens — most of which are tool outputs that the model only needed briefly. The KV cache balloons, latency spikes, and eventually the model hits its context limit and breaks down.

Kimi K2.5's Three-Part Solution

1. Agent Swarm — Parallel Decomposition

Instead of a single agent chewing through a problem step-by-step (which creates one enormous, ever-growing KV cache), K2.5 introduces Agent Swarm: a framework where a central orchestrator breaks a complex task into independent sub-problems, then spawns multiple sub-agents to work on them simultaneously.

How Agent Swarm Manages KV Cache

Each sub-agent runs in its own isolated context window with its own KV cache. Instead of one monstrous 500K-token context, you get ten 50K-token contexts running in parallel. When a sub-agent finishes, only its final output (a few hundred tokens) gets sent back to the orchestrator — not its full reasoning chain or tool outputs. The sub-agent's KV cache is then freed entirely.

This is essentially PagedAttention philosophy applied at the agent level: allocate memory where needed, free it when done, and never let any single context grow unbounded.

The result: 4.5× latency reduction over single-agent baselines on complex web search tasks, and a 17.8% accuracy improvement on BrowseComp (60.6% → 78.4%).

2. PARL — Teaching the Model When to Parallelize

The orchestrator doesn't use hand-coded rules to decide when to spawn sub-agents. Instead, Moonshot trained it with a custom reinforcement learning method called Parallel-Agent RL (PARL). The reward function has three components:

PARL Reward Function rPARL = λ1 · rparallel + λ2 · rfinish + rperf(x, y)
rparallel
Instantiation Reward
Encourages the model to actually spawn parallel agents instead of doing everything sequentially ("serial collapse")
rfinish
Completion Reward
Prevents spawning useless sub-agents that don't finish their tasks ("spurious parallelism")
rperf
Task Performance
The final answer quality — did the agent actually solve the user's problem?

A critical design choice: sub-agents are frozen (their weights don't update during PARL training). Only the orchestrator learns. This avoids a nasty problem in multi-agent RL where you can't tell which agent deserves credit for a good outcome.

3. Toggle — Budget-Aware Reasoning

Kimi K2.5 uses a training technique called Toggle that alternates between two modes:

Budget-Limited Phase
Forces the model to reason concisely
Reduces output tokens by 25–30%. The model learns to skip unnecessary reasoning steps.
Standard Scaling Phase
Allows extended reasoning chains
For hard problems, the model can use 7K–36K reasoning tokens before acting.

Why this matters for KV cache: reasoning tokens are expensive. A "thinking" model like K2.5 might generate 36,000 internal reasoning tokens before producing a 200-token answer. Those 36K tokens all live in the KV cache. Toggle teaches the model to be frugal — use 7K tokens for easy sub-tasks, save the 36K budget for genuinely hard ones. At ~70 KB per token (MLA), that's the difference between 490 MB and 2.5 GB of KV cache just for the thinking phase.

Discard-All Context Management

For extremely long agent sessions, K2.5 uses a strategy called "Discard-all": after each major reasoning phase, the model discards the full context (including tool outputs) and starts fresh with only the accumulated conclusions. This effectively resets the KV cache periodically rather than letting it grow to the context limit.

On the BrowseComp benchmark, using Discard-all boosted accuracy from 60.6% to 74.9% — beating GPT-5.2's reported 65.8%, Claude Opus 4.5 (37.0%), and Gemini 3 Pro (37.8%). The surprising part: throwing away context helped, because a bloated context full of stale tool outputs actually confuses the model more than it helps.

The KV Cache Implications

Aspect Standard Chat Mode Kimi K2.5 Agent Mode
Context growth Linear, predictable Bursty (tool outputs spike it), then reset
Max KV cache per session Context window × 70 KB Bounded by sub-agent window (~50K tokens)
Parallelism One cache per user Multiple caches (orchestrator + N sub-agents)
Cache lifetime Entire conversation Sub-agent caches freed after each sub-task
Reasoning overhead Minimal (short CoT) 7K–36K reasoning tokens per step (Toggle-managed)
Tool calls per session 0–5 typically Hundreds of tool calls across parallel sub-agents (paper reports up to 100 tool calls per sub-agent in BrowseComp)
Context resets None Periodic (Discard-all after each phase)

Why This Matters for the Future of KV Cache

Kimi K2.5 demonstrates that as AI shifts from "chatbots" to "agents," the KV cache problem transforms from a simple memory-scaling challenge into a dynamic memory management problem. The solutions aren't just about compressing each token's cache (MLA, GQA, quantization) — they're about deciding which contexts to keep alive, when to reset them, and how to distribute work across parallel cache instances. Agent Swarm is essentially the agentic equivalent of PagedAttention: instead of paging KV blocks within a single context, it pages entire contexts across multiple parallel agents.

Chapter 17

The State of the Art (2026) & Future Directions

As of early 2026, the KV cache problem that once threatened to halt the scaling of large language models has been substantially addressed through a multi-pronged approach. Here's where things stand:

What's Working Today

GQA
Industry Standard
LLaMA-3/4, Qwen3, Gemma 4, Phi-4, Mistral
MLA
Best Compression Ratio
DeepSeek-V3, Kimi K2 (1T params)
Hybrid Attention
Extreme Long Context
LLaMA 4 Scout (10M), Gemma 4 E2B/E4B (128K mobile/edge)

The Emerging Frontier

1. Trillion-Parameter MLA: Moonshot AI's Kimi K2 pushed MLA to 1.04 trillion parameters with 384 experts — proving that DeepSeek's latent compression scales far beyond its original 236B deployment. With only 64 attention heads (vs DeepSeek-V3's 128), K2 squeezes even more efficiency out of MLA.

2. Ten-Million-Token Context: Meta's public LLaMA 4 Scout model card reports a 10 million token context window. That is roughly 5,000 pages of text in a single prompt, even though the public card does not spell out every attention/KV implementation detail behind it.

3. On-Device Intelligence: Gemma 4 E2B and E4B target mobile and edge deployments with 128K+ context. Their small KV budgets come from low KV-head counts, local-global attention, KV sharing across layers, and per-layer embeddings. Separately, Gemma 4 26B-A4B is the MoE family member that uses sparse routing (3.8B active of 26B); that MoE detail should not be attributed to the E2B/E4B variants. Post-training KV cache quantization methods (such as TurboQuant or KIVI) could further reduce memory, though Google's Gemma 4 documentation does not cite any specific KV quantization scheme as part of the on-device strategy.

4. Aggressive KV Head Reduction: Qwen3's MoE variant uses just 4 KV heads with 64 query heads — a 16:1 ratio. This trend toward extreme Q-to-KV ratios suggests the field is converging on the insight that you need very few KV heads, even for models with many query heads.

5. Data-Oblivious Quantization: TurboQuant requires no training data or model-specific tuning, making it a universal plug-in for any deployment. Combined with architectural improvements, effective KV cache reductions of 50–100× over the 2017 baseline are now routine.

The Journey in Numbers

Metric Naive FP16 MHA Baseline (e.g. LLaMA-2 7B, 2023) 2026 (State of the Art) Improvement
KV Cache per token (7B model) 512 KB (FP16 MHA) ~10–20 KB (MLA + quant) 25–50×
Max context on single GPU ~2K tokens 1M+ tokens 500×
Serving throughput Baseline 29× (with eviction) 29×
Memory waste in serving 60–80% ~0% (PagedAttention) Eliminated
Smallest device for 8B+ models Data center GPUs Smartphones (Bonsai 8B: 44 tok/s on iPhone) Democratized

The Big Picture

The story of KV cache optimization is a story of the AI community refusing to accept limitations. When the "memory wall" threatened to stop progress, researchers attacked the problem from every angle: architectural changes (MQA, GQA, MLA), memory management (PagedAttention), numerical compression (KIVI, TurboQuant), intelligent caching (H2O, SnapKV), and attention pattern design (sliding window, local-global). Together, these advances have reduced the effective KV cache cost by 50–100×, enabling language models to run on devices that fit in your pocket while handling documents longer than entire novels.

Visual Recap

The Full Journey — MHA to MLA

Video + Canvas

Watch the recap or scrub the interactive version

Prefer a clean export over a live animation? Use the fast social-friendly recap or the slower full walkthrough below. The original interactive canvas is still here if you want to explore the whole visual journey inside the article.

Fast recap video

A lighter export that works well for article readers and social previews. It covers the full progression from MHA to MLA without the slower pauses.

Slow full walkthrough

A slower export for readers who want time to follow the labels, cache counters, and architecture transitions frame by frame.

Chapter 18

References

  1. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., and Polosukhin, I. "Attention Is All You Need." NeurIPS 2017. arXiv:1706.03762
  2. Shazeer, N. "Fast Transformer Decoding: One Write-Head is All You Need." arXiv preprint, 2019. arXiv:1911.02150
  3. Ainslie, J., Lee-Thorp, J., de Jong, M., Zemlyanskiy, Y., Lebron, F., and Sanghai, S. "GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints." EMNLP 2023. arXiv:2305.13245
  4. Touvron, H., et al. "Llama 2: Open Foundation and Fine-Tuned Chat Models." arXiv preprint, 2023. arXiv:2307.09288
  5. Llama Team, AI @ Meta. "The Llama 3 Herd of Models." arXiv preprint, 2024. arXiv:2407.21783
  6. DeepSeek-AI. "DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model." arXiv preprint, 2024. arXiv:2405.04434
  7. DeepSeek-AI. "DeepSeek-V3 Technical Report." arXiv preprint, 2024. arXiv:2412.19437
  8. Jiang, A.Q., et al. "Mistral 7B." arXiv preprint, 2023. arXiv:2310.06825
  9. Gemma Team, Google. "Gemma: Open Models Based on Gemini Research and Technology." arXiv preprint, 2024. arXiv:2403.08295
  10. Gemma Team, Google. "Gemma 2: Improving Open Language Models at a Practical Size." arXiv preprint, 2024. arXiv:2408.00118
  11. Gemma Team, Google. "Gemma 3 Technical Report." arXiv preprint, 2025. arXiv:2503.19786
  12. Yang, A., et al. "Qwen2 Technical Report." arXiv preprint, 2024. arXiv:2407.10671
  13. Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Yu, C.H., Gonzalez, J.E., Zhang, H., and Stoica, I. "Efficient Memory Management for Large Language Model Serving with PagedAttention." SOSP 2023. arXiv:2309.06180
  14. Dao, T., Fu, D.Y., Ermon, S., Rudra, A., and Rรฉ, C. "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness." NeurIPS 2022. arXiv:2205.14135
  15. Dao, T. "FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning." arXiv preprint, 2023. arXiv:2307.08691
  16. Liu, Z., Yuan, J., Jin, H., Zhong, S., Xu, Z., Braverman, V., Chen, B., and Hu, X. "KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache." ICML 2024. arXiv:2402.02750
  17. Hooper, C., Kim, S., et al. "KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization." NeurIPS 2024. arXiv:2401.18079
  18. Google Research. "TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate." arXiv preprint, 2025. arXiv:2504.19874
  19. Zhang, Z., Sheng, Y., et al. "Hโ‚‚O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models." NeurIPS 2023. arXiv:2306.14048
  20. Liu, Z., Desai, A., Liao, F., Wang, W., Xie, V., Xu, Z., Kyrillidis, A., and Shrivastava, A. "Scissorhands: Exploiting the Persistence of Importance Hypothesis for LLM KV Cache Compression at Test Time." NeurIPS 2023. arXiv:2305.17118
  21. Xiao, G., Tian, Y., Chen, B., Han, S., and Lewis, M. "Efficient Streaming Language Models with Attention Sinks." ICLR 2024. arXiv:2309.17453
  22. Li, Y., et al. "SnapKV: LLM Knows What You are Looking for Before Generation." NeurIPS 2024. arXiv:2404.14469
  23. Yang, D., et al. "PyramidInfer: Pyramid KV Cache Compression for High-throughput LLM Inference." arXiv preprint, 2024. arXiv:2405.12532
  24. Cai, Z., et al. "PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling." arXiv preprint, 2024. arXiv:2406.02069
  25. Brandon, W., Mishra, M., Nrusimha, A., Panda, R., and Ragan Kelly, J. "Reducing Transformer Key-Value Cache Size with Cross-Layer Attention." arXiv preprint, 2024. arXiv:2405.12981
  26. Qwen Team. "Qwen3 Technical Report." arXiv preprint, 2025. arXiv:2505.09388
  27. Kimi Team, Moonshot AI. "Kimi K2: Open Agentic Intelligence." arXiv preprint, 2025. arXiv:2507.20534
  28. Meta AI. "The Llama 4 Herd: The Beginning of a New Era of Natively Multimodal AI Innovation." Meta AI Blog, 2025. meta.ai
  29. Microsoft Research. "Phi-4 Technical Report." arXiv preprint, 2024. arXiv:2412.08905
  30. Microsoft Research. "Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs." arXiv preprint, 2025. arXiv:2503.01743
  31. Kimi Team, Moonshot AI. "Kimi K2.5: Visual Agentic Intelligence." arXiv preprint, 2026. arXiv:2602.02276
  32. Touvron, H., Lavril, T., Izacard, G., et al. "LLaMA: Open and Efficient Foundation Language Models." arXiv preprint, 2023. arXiv:2302.13971
  33. Penedo, G., et al. "The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only." arXiv preprint, 2023. arXiv:2306.01116
  34. Qwen Team. "Qwen3.5 Model Series." Model card, 2025. HuggingFace: Qwen3.5-9B
  35. PrismML / Caltech. "1-bit Bonsai: The First Commercially Viable 1-bit LLMs." Whitepaper, 2025. GitHub