MHA burned memory. MQA burned quality. Here's what actually worked.
By Pradip Tivhale
June 2017. Eight researchers at Google drop a paper with one of the boldest titles in computer science history: "Attention Is All You Need." Inside was a new architecture they called the Transformer — and it would go on to power virtually every major AI system built in the decade that followed.
To appreciate why the Transformer mattered, you need to understand what came before it. The reigning champion of text processing was the Recurrent Neural Network (RNN). Imagine a relay race where each runner must memorize and recite every message carried by all previous runners before passing the baton. Runner #50 has to recite 49 accumulated messages, then add their own. Inevitably, early messages get garbled or lost. Worse, the race is strictly sequential: runner #10 can't start until runner #9 finishes, no matter how many spare runners you have standing around. That's the RNN — slow, forgetful, and impossible to speed up by adding more hardware.
The Transformer discarded that relay race entirely. Its breakthrough? Attention. Instead of passing messages down a chain one at a time, it spreads the entire text on a giant table and lets every word look at every other word simultaneously. Think of a room of 500 people at a networking event where everyone can hear every conversation happening at once and instantly decide which ones are relevant to them. No chain. No waiting. Massively parallel.
The numbers spoke for themselves. On the WMT 2014 English-to-German translation benchmark, the Transformer hit a BLEU score of 28.4 — blowing past the previous best by over 2 points. In machine translation, that's not an incremental gain; it's a big jump. On top of that, training was much faster because the architecture could chew through all tokens in parallel rather than one at a time.
What followed was an explosion. BERT. GPT-2. GPT-3. GPT-4. LLaMA. Gemma. Qwen. DeepSeek. Mistral. Hundreds of models, all built on the same Transformer backbone. But as these models ballooned from millions to hundreds of billions of parameters — and as users demanded they handle longer and longer documents — a hidden cost quietly grew in the background. A cost that would eventually consume more GPU memory than the model weights themselves. That cost has a name: the KV Cache.
To understand the KV Cache, we first need to understand how attention works. The Transformer uses a mechanism called Scaled Dot-Product Attention, and it relies on three things: Queries (Q), Keys (K), and Values (V).
Here is an analogy that makes it concrete. Imagine you are ordering at a massive food court with 200 stalls:
In the Transformer, each word (or "token") generates three things: a Query, a Key, and a Value — all represented as lists of numbers (vectors). The process works like the food court: your craving (Query) gets compared against every stall's menu board (Keys) to produce a relevance score. High-scoring stalls contribute more of their dish (Value) to your final plate. Low-scoring ones contribute almost nothing. The result is a weighted blend of Values, customized to what this particular token was "looking for."
In plain English: "Scan your craving (Q) against every stall's menu board (K), calculate a match percentage for each (softmax turns raw scores into percentages that add up to 100%), then blend the dishes (V) in proportion to how well they matched."
The √dk part is just a scaling factor to keep numbers from getting too large. It's the square root of the dimension of the key vectors.
Each Q, K, and V vector has a specific size called the head dimension (dh), typically 128 numbers. In a model like LLaMA-2 7B, there are 32 attention heads, each with dimension 128, giving a total hidden size of 4,096.
During training, the model processes all tokens at once, so Q, K, and V are all computed together. But during inference (when the model generates text one token at a time), something important happens: the model generates one new token, but it needs to attend to ALL previous tokens. This means it needs the Keys and Values of every previous token — and that's where the KV Cache enters the picture.
Here's how a language model actually generates text. It doesn't write a whole sentence at once — it produces one token (roughly one word) at a time. Suppose the model has already written "The cat sat on the" and now needs to pick the next word. To make that decision, it has to run the attention mechanism, which means comparing the new token against every single previous token.
Now here's the wasteful part. Without caching, the model recomputes the Key and Value vectors for "The," "cat," "sat," "on," and "the" from scratch every single time it generates a new word. At token 1,000? It recomputes K and V for all 999 previous tokens. At token 1,001? All 1,000. You can see how this snowballs into an absurd amount of redundant work.
The KV Cache is the fix. It simply stores the Key and Value vectors once they're computed and reuses them for all future tokens. Each new token only needs to compute its own Q, K, V, then look up everything else from the cache. No redundant recomputation. Simple and effective.
Toy scaled-dot-product example: one 1×4 Query is compared against five cached Keys, scaled by √dk, then turned into attention weights and a weighted V output.
Without KV Cache, generating a sequence of length n requires O(n²) total computation (because each new token must attend to all previous tokens, and this happens n times). With KV Cache, it drops to O(n) per new token, because we only compute attention for one new query against cached keys and values.
Here's the catch: those cached K and V vectors consume GPU memory, and they grow linearly with sequence length. The formula for KV Cache memory in FP16 (16-bit floating point, 2 bytes per number) is:
The first "2" is for Keys and Values. The final "2 bytes" is for FP16 precision. Let's see what this means for real models:
| Model | Layers | KV Heads | Head Dim | KV Cache / Token (FP16) | KV Cache @ 4K Tokens | KV Cache @ 128K Tokens |
|---|---|---|---|---|---|---|
| LLaMA-1 65B MHA | 80 | 64 | 128 | 2,560 KB | 10 GB | 320 GB |
| Falcon-7B MQA | 32 | 1 | 64 | 8 KB | 0.03 GB | 1.0 GB |
| Mistral 7B GQA+SWA | 32 | 8 | 128 | 128 KB | 0.5 GB | 16 GB |
| Qwen3 235B GQA-4 | 94 | 4 | 128 | 189 KB | 0.76 GB | 24 GB |
| DeepSeek-V3 MLA | 61 | Latent (dc=512) | 576* | 69 KB | 0.27 GB | 8.6 GB |
| Gemma 4 31B GQA+LG | 60 | 16 (all layers) | 256 local / 512 global | ~1,120 KB naive | 1.1 GB | ~41 GB† |
|
†Gemma 4 31B: 60 layers (50 sliding, 10 global), all with 16 KV heads. Sliding layers use head_dim=256 and are capped at 1,024 tokens. Global layers use head_dim=512 with full context. Cache is sub-linear: ~1.1 GB at 1K but ~41 GB at 128K because global layers dominate at long contexts. Source: official config.json. *MLA stores a compressed latent of 576 dims (dc=512 + dR=64) instead of separate K and V vectors. Formula: layers × 576 × 2 bytes. | ||||||
For LLaMA-1 65B (pure MHA) at 128K context, the KV cache requires 320 GB — while the model weights in FP16 are about 130 GB. The cache is 2.5x the model. Even for a 7B model with MHA, the cache hits 64 GB at 128K context while the model is only 14 GB (4.5x ratio). This is the "memory wall" that forced the industry to rethink attention.
Illustrative LLaMA-2 7B FP16-style baseline: each new token adds one more cached K/V state for every layer. Watch GPU memory fill up over time.
The KV Cache creates three critical problems that limit how we can deploy and use large language models:
An NVIDIA A100 GPU can perform 312 trillion floating-point operations per second (312 TFLOPS). That's a lot of raw math power. But it can only move data from its memory at about 2 terabytes per second. When a model generates text one token at a time, the actual math is trivial — multiply one query vector against a bunch of cached keys. The bottleneck isn't doing the multiplication; it's loading all those keys and values from memory fast enough. Engineers call this being memory-bandwidth bound.
Here's an analogy that makes this click. Imagine a pizza oven that can bake 1,000 pizzas per minute (the GPU's compute). Sounds amazing, right? Except the kitchen only has one narrow doorway for ingredients to come through (memory bandwidth). The oven sits idle most of the time, waiting for dough and toppings to arrive. Making the oven bigger doesn't help. You need either a wider doorway — or smaller pizzas.
There's another subtle issue. When serving multiple users simultaneously, each request may need a different amount of KV cache (because they have different conversation lengths). Traditional systems pre-allocate a fixed-size memory block for each request, leading to massive waste when the actual conversation is shorter than the maximum. The PagedAttention paper (Section 11) showed that existing systems waste 60-80% of KV cache memory due to fragmentation.
These problems inspired a decade of research that we'll explore in the following chapters. The solutions fall into several categories:
1. Fewer KV heads: MQA, GQA, MLA — reduce how many separate Key-Value pairs you store
2. Smarter memory management: PagedAttention — eliminate waste in how memory is allocated
3. Smaller numbers: KIVI, KVQuant, TurboQuant — use fewer bits per number
4. Fewer tokens cached: H2O, SnapKV, StreamingLLM — don't cache everything
5. Shorter attention span: Sliding window, local-global — limit how far back the model looks
The original Transformer uses Multi-Head Attention (MHA). In MHA, the model runs multiple parallel "attention heads" — each one learns to focus on different types of word relationships. One head might specialize in tracking who did what to whom (subject-object). Another might track temporal cues ("before," "after," "meanwhile"). A third might notice negation patterns. They each develop their own specialty without being explicitly told what to learn.
In MHA, every attention head has its own separate set of Keys, Values, and Queries. For a model with 32 attention heads:
For LLaMA-2 7B: 2 × 32 layers × 32 heads × 128 dim × 2 bytes = 524,288 bytes = 512 KB per token.
MHA provides the best quality because each head maintains its own representation of Keys and Values. However, the KV cache grows proportionally to the number of heads, making it the most memory-expensive approach.
A common source of confusion: the "KV" in "KV Cache" does not mean a single combined vector. K (Key) and V (Value) are two completely independent vectors, each of size head_dim (typically 128 numbers). They are computed by separate weight matrices (WK and WV), stored separately in memory, and used at different stages of the attention computation. K is used to calculate relevance scores (multiplied with Q), while V provides the actual content that gets weighted and summed. The "2" in the KV cache formula 2 × layers × heads × dim accounts for these being two distinct stored vectors per head.
November 2019. Noam Shazeer — one of the eight original Transformer authors — publishes a short, almost casual paper with an idea so simple it seems like it shouldn't work: make all the attention heads share a single set of Keys and Values.
That's Multi-Query Attention (MQA) in one sentence. You still have 32 query heads asking 32 different questions about the text. But instead of each head maintaining its own personal reference sheet (its own Keys and Values), they all consult the exact same sheet. Imagine 32 food critics at a restaurant — each one evaluates the meal from a different angle (texture, aroma, plating, spice level), but they're all tasting from the same set of plates. One kitchen. Thirty-two opinions.
For a model with 32 query heads but only 1 KV head, the KV cache shrinks by 32× compared to MHA!
Shazeer demonstrated that since autoregressive decoding is bottlenecked by memory bandwidth (not compute), shrinking the KV cache directly speeds up token generation. In practice, models using MQA generated tokens significantly faster, while benchmark scores dropped only slightly compared to the full MHA setup.
While MQA provides the maximum memory savings, forcing all heads to share a single KV representation can hurt model quality, especially for complex reasoning tasks. This quality concern led to the development of Grouped-Query Attention (next chapter).
MQA was adopted by several notable models including PaLM (Google, 2022), Falcon-7B (TII, 2023), and StarCoder (BigCode, 2023). Note: Falcon-40B and Falcon-180B use GQA (8 KV heads), not MQA — only the 7B variant uses true single-head MQA. Concerns about quality degradation led most subsequent models to prefer GQA instead.
By mid-2023, the AI community faced a dilemma: MHA gave the best quality but was memory-hungry, while MQA saved memory but sacrificed accuracy. Joshua Ainslie and his team at Google asked the natural question: what if we do something in between?
Their answer was Grouped-Query Attention (GQA). Rather than giving each query head its own private KV pair (MHA) or forcing all query heads to share one KV pair (MQA), GQA organizes query heads into small groups. Each group gets its own dedicated KV pair. For instance, with 32 query heads split into 8 groups, every 4 query heads share one KV head.
The paper made two important contributions:
1. Uptraining Recipe: Existing MHA models can be converted to GQA using only 5% of original pre-training compute. This means you don't need to retrain from scratch.
2. Quality-Speed Trade-off: GQA with 8 KV heads achieves quality close to MHA while maintaining speed comparable to MQA. It's the Goldilocks solution.
GQA quickly became the de facto standard for modern large language models. The following table shows its widespread adoption:
| Model | Year | Attention Type | Query Heads | KV Heads | KV Reduction vs MHA |
|---|---|---|---|---|---|
| LLaMA-2 7B/13B | 2023 | MHA | 32 / 40 | 32 / 40 | 1× (baseline) |
| LLaMA-2 70B | 2023 | GQA | 64 | 8 | 8× |
| LLaMA-3 8B/70B/405B | 2024 | GQA | 32/64/128 | 8 | 4× / 8× / 16× |
| Mistral 7B | 2023 | GQA | 32 | 8 | 4× |
| Qwen-2 | 2024 | GQA | varies | varies | 4–8× |
| Gemma 2 | 2024 | GQA | varies | varies | 4–8× |
May 2024. While the rest of the industry was settling into GQA as "good enough," a Chinese AI lab called DeepSeek dropped a paper that made everyone do a double take. Their approach didn't just tweak the number of KV heads — it threw out the idea of storing Keys and Values at all. Instead, DeepSeek compressed the full K and V information for each token into a tiny "latent" vector, then reconstructed the actual Keys and Values on-the-fly during inference from this compressed seed.
Think of it in terms of music. MHA is like storing full uncompressed WAV recordings of every instrument in an orchestra — pristine audio, but terabytes of disk space. GQA is like grouping instruments into sections (strings, brass, woodwinds) and keeping one recording per section. MLA goes further still: it doesn't store any audio at all. Instead, it stores a tiny MIDI-like encoding — just the notes, velocities, and timing — from which a synthesizer can reconstruct a faithful rendition of the full orchestra on demand. That MIDI encoding is the "latent vector," and the synthesizer is a learned matrix multiplication built into the model weights.
Mathematically, instead of storing the full Key matrix K (size: n_heads × d_head) and Value matrix V (size: n_heads × d_head) for each token, MLA stores a single compressed vector ct of dimension dc, where dc is much smaller than the full KV size.
In DeepSeek-V2, the specific parameters are:
The KV cache per token requires only (dc + dhR) = 576 elements, which is equivalent to GQA with only 2.25 KV groups. For comparison, standard MHA with 128 heads would need 2 × 128 × 128 = 32,768 elements per layer.
In December 2024, DeepSeek released V3, which continued using MLA and achieved even more impressive results:
| Model | Attention Type | Total Params | KV Cache / Token | vs. LLaMA-3 405B |
|---|---|---|---|---|
| LLaMA-3.1 405B | GQA (8 heads) | 405B | 504 KB | 1× (baseline) |
| Qwen-2.5 72B | GQA | 72B | 320 KB | 0.63× |
| DeepSeek-V3 | MLA | 671B (37B active) | 69 KB | 0.14× (7.3× smaller) |
MLA achieves better quality than MHA (not just comparable) while using far less KV cache memory. The DeepSeek-V2 paper showed MLA actually outperforms MHA on benchmarks, possibly because the compression acts as regularization. This killed the longstanding assumption that reducing KV cache always costs you quality.
The animation below runs all four attention methods side by side. Watch tokens arrive one at a time and see how much cached attention state each method stores. This is a normalized 32-layer FP16 comparison so the methods are compared on the same footing.
All the methods discussed so far (MHA, MQA, GQA, MLA) reduce KV cache along the "head" dimension — fewer or compressed heads. But what about another dimension: layers?
In May 2024, researchers from Microsoft proposed Cross-Layer Attention (CLA): instead of each transformer layer maintaining its own KV cache, adjacent layers can share their Key-Value pairs.
The idea is simple: divide the transformer layers into groups (e.g., pairs of adjacent layers), and within each group, all layers use the Keys and Values from the bottom layer of the group. This means you only need to store KV cache for half the layers.
CLA can reduce the KV cache size by 2× while maintaining nearly the same accuracy as standard MQA. Importantly, CLA is orthogonal to head-level compression — it can be combined with GQA or MQA for even greater savings. For example, GQA + CLA could achieve a combined 8–16× reduction over MHA.
Another approach to reducing KV cache is limiting how far back the model looks. Instead of attending to every previous token (potentially hundreds of thousands), the model only looks at a fixed window of recent tokens.
In October 2023, Mistral AI published the Mistral 7B model, which introduced Sliding Window Attention (SWA) with a window size of 4,096 tokens. Each layer only attends to the previous 4,096 tokens, meaning the KV cache per layer is capped at a fixed size regardless of the total sequence length.
The key insight is that information can still flow across the entire sequence through stacked layers. With a window of W = 4,096 and k layers stacked, the effective attention span is W × k. For Mistral 7B with 32 layers, this gives a theoretical span of approximately 131,000 tokens even though each layer only looks at 4,096.
Google's Gemma models took a more nuanced approach by interleaving two types of attention layers:
| Model | Local Window | Global Layers | Local:Global Ratio | KV Cache Benefit |
|---|---|---|---|---|
| Gemma 2 | 4,096 tokens | Every other layer | 1:1 | ~50% of full attention cache |
| Gemma 3 | 1,024 tokens | Every 6th layer | 5:1 | Major reduction for 128K context |
| Gemma 4 (31B / E4B / E2B) | 1024 (31B) / 512 (E4B, E2B) | Every 6th layer | 5:1 | 128K+ context; E4B shares KV across 18 layers; 31B does not share (num_kv_shared_layers=0) |
Local layers use sliding window attention: they only look at nearby tokens, so their KV cache is small and fixed. Global layers use full attention: they look at all tokens, requiring a larger KV cache. By making most layers local (5:1 ratio in Gemma 3), the total KV cache is drastically smaller.
By using a 5:1 local-to-global ratio with only 1,024-token local windows, Gemma 3 can handle 128K token contexts on accessible hardware. Only 1 out of every 6 layers needs to store full-length KV cache. The other 5 layers only store 1,024 tokens each — a fraction of the full sequence. This allows Gemma 3 27B to fit on a single GPU with 128K context.
Everything we've discussed so far makes the KV cache theoretically smaller. But there's a different, more mundane problem: even when the cache should fit in GPU memory, the software managing that memory does an awful job. Chunks get allocated, freed, re-allocated in different sizes — and before long, the GPU's memory looks like a parking lot full of oddly-spaced cars with no room for a new one.
In September 2023, Woosuk Kwon and colleagues at UC Berkeley published a fix that was hiding in plain sight for decades — in your computer's operating system, of all places. The idea: treat GPU memory for KV cache the same way Linux treats RAM. Break it into fixed-size pages. Don't require them to be contiguous. Map them with a lightweight table. They called it PagedAttention.
In traditional LLM serving, the system pre-allocates a contiguous block of memory for each request's KV cache based on the maximum possible sequence length. This leads to three types of waste:
The paper found that existing systems waste 60–80% of KV cache memory!
PagedAttention divides the KV cache into fixed-size blocks (like pages in virtual memory). These blocks don't need to be contiguous in physical GPU memory — they can be scattered anywhere, connected by a simple lookup table. This eliminates fragmentation almost entirely.
The resulting system, called vLLM, posted clear gains:
vLLM became one of the most widely adopted open-source LLM serving frameworks. PagedAttention's approach to KV cache memory management has been adopted by several other serving systems including TensorRT-LLM and SGLang.
Another powerful strategy: instead of caching Keys and Values for every token, only keep the ones that actually matter. Research has shown that attention is highly concentrated — a small fraction of tokens receive the vast majority of attention.
Zhang et al. analyzed how attention is distributed across tokens and found a clear pattern: the vast majority of the total attention score is concentrated on a tiny fraction of tokens — what they call "heavy hitters." Most tokens contribute almost nothing to the final output. Based on this observation, their H2O system retains only these high-impact tokens alongside a window of recent tokens in the KV cache, discarding everything else.
With only 20% heavy hitters retained, H2O improved throughput by up to 29× over DeepSpeed Zero-Inference and Hugging Face Accelerate on OPT-6.7B and OPT-30B models.
Liu et al. proposed the "Persistence of Importance" hypothesis: tokens that were important at one generation step will continue to be important in future steps. This insight allows the system to identify important tokens early and keep only those.
Scissorhands achieves up to 5× KV cache reduction without quality loss, and when combined with 4-bit quantization, reaches 20× compression.
Xiao et al. at MIT uncovered a curious behavior in transformer models: no matter what text you feed them, the very first few tokens in the sequence always attract unusually high attention scores — even when those tokens carry no meaningful content (like a period or a padding symbol). The researchers named these positions "attention sinks." By preserving just these initial sink tokens alongside a sliding window of the most recent tokens, the model can process infinitely long sequences with a fixed memory budget.
StreamingLLM enabled models like Llama-2, MPT, and Falcon to process up to 4 million tokens with stable performance, achieving 22.2× speedup over sliding window recomputation baselines.
Li et al. noticed something surprising: even before the model starts generating new tokens, the pattern of which prompt tokens each attention head focuses on is already stable and predictable. By examining attention weights in a small window near the end of the prompt, SnapKV can anticipate which earlier tokens will matter most and pre-select them for caching, discarding the rest.
Two complementary papers discovered that the number of important tokens decreases across layers. Lower layers need more cached tokens, while higher layers can function with very few. This pyramidal pattern allows allocating more cache budget to lower layers and less to higher layers.
PyramidInfer achieved 2.2× throughput improvement with over 54% GPU memory reduction. PyramidKV matched full KV cache performance while retaining only 12% of the cache.
Every approach so far has tried to cache fewer things. Quantization takes a completely different angle: cache the same things, but represent each number using fewer bits. It's the difference between storing fewer photos versus compressing each photo. In standard deployments, each number in the KV cache occupies 16 bits (FP16). The question quantization researchers ask: can we get away with 8, 4, or even 2 bits?
Think about how you describe someone's height. With FP16 precision, you'd say "175.38 cm" — exact to the fraction of a millimeter. With INT8, you'd round to "175 cm." With INT4, maybe "tall" or "short" with a few gradations. With INT2, you get just four buckets: "very short," "short," "tall," "very tall." Each step saves storage space but throws away some nuance. The gamble with KV cache quantization: can the model still reason accurately when its cached memories are stored as rough sketches instead of precise measurements?
Illustrative scalar view of KV quantization: the values are rounded more aggressively as bit width drops. Real systems such as KIVI and TurboQuant use more sophisticated schemes than this simple bar demo.
Liu et al. stumbled onto something that previous quantization work had missed. When you look at the raw numbers inside Keys and Values, they have different shapes of outliers. The Key matrix has certain channels (columns) that consistently contain huge values across all tokens — so you need to calibrate each channel separately. The Value matrix is the opposite: certain tokens (rows) spike across all channels. KIVI exploits this asymmetry by using per-channel quantization for Keys and per-token quantization for Values, squeezing both down to just 2 bits per number — a raw 8× compression from FP16.
Across LLaMA, Falcon, and Mistral model families, KIVI preserved benchmark accuracy with negligible degradation — all without any retraining or fine-tuning step.
Hooper et al. pushed quantization further with several innovations: per-channel key quantization, pre-RoPE key quantization, non-uniform quantization, and per-vector dense-and-sparse quantization.
The results enabled:
Google's TurboQuant represents the cutting edge of KV cache quantization. Unlike previous methods that required careful calibration data or model-specific tuning, TurboQuant is completely data-oblivious — it works on any model without seeing any data.
The method works by:
All the quantization methods above shrink the KV cache. But what happens when you also shrink the model weights to the extreme? In mid-2025, Caltech spinoff PrismML released Bonsai 8B — a language model where every single weight is represented as just one bit: either +1 or −1. An 8.2 billion parameter model that fits in 1.15 GB (compared to ~16 GB for a standard FP16 8B model). That's a 14× size reduction for the weights alone.
But here's the catch that relates to our KV cache story: 1-bit weights do NOT mean 1-bit KV cache. The weights (WK, WV, WQ, etc.) are 1-bit, but when they multiply against the input activations, the resulting Key and Value vectors are still full-precision floating point numbers. So even though Bonsai's weights are tiny, the KV cache during inference remains FP16-sized and grows linearly with context length — exactly the same memory wall problem.
This is where Turbo1Bit comes in: an open-source project that stacks TurboQuant-style KV cache quantization on top of Bonsai's 1-bit weights.
| Config (Bonsai 8B @ 65K context) | Model Weights | KV Cache | Total Memory | Reduction |
|---|---|---|---|---|
| Standard FP16 8B model | ~16 GB | ~8 GB | ~24 GB | 1× baseline |
| Bonsai 1-bit (no KV compression) | 1.15 GB | ~8 GB (FP16) | 10.6 GB | 2.3× |
| Bonsai 1-bit + Turbo1Bit (Q4_0 KV) | 1.15 GB | ~2 GB | 4.0 GB | 6× |
Turbo1Bit's KV cache compression findings for 1-bit models specifically:
| KV Quantization | Perplexity (WikiText-2) | vs Baseline | Memory Saving |
|---|---|---|---|
| FP16 (baseline) | 25.51 | — | 1× |
| Q8_0 (8-bit) | 25.49 | −0.1% | 1.75× |
| Q5_0 (5-bit) | 25.87 | +1.4% | 2.5× |
| Q4_0 (4-bit) | 26.82 | +5.1% | 2.91× |
Bonsai reveals a striking reality about the future of efficient models. When you compress model weights to 1-bit (1.15 GB for an 8B model), the KV cache becomes the overwhelming majority of memory usage. At 65K context with FP16 KV cache, the cache is 7× larger than the model itself. This flips the traditional assumption that model weights dominate memory. For 1-bit models, KV cache optimization isn't optional — it's essential. This is why combining Bonsai with TurboQuant-style KV compression yields such dramatic results: total memory drops from 10.6 GB to 4 GB.
Bonsai 8B is a useful sanity check for experts because it cleanly separates weight compression from runtime memory. The weights are tiny at about 1.1 GB, but the local benchmark logs in PrismML/Bonsai-demo still show 4,608 MiB of FP16 KV cache at 32K context. In practice, that means the bottleneck has not disappeared — it has simply moved from the model file to the cache.
The more interesting deployment story is Bonsai 8B plus TurboQuant-style Q4_0 KV compression. In the same local logs, the KV cache falls from 4,608 MiB to 1,296 MiB at 32K, and total resident memory drops from 6,011 MiB to 2,699 MiB. At 60K context, the compressed run still lands at about 3,782 MiB total. That is the point where 1-bit weights stop being a neat packaging trick and start looking like a serious long-context systems design.
PrismML/Bonsai-demo/memory_benchmark_results.json and PrismML/Bonsai-demo/turboquant_kv_results.json record Bonsai-8B at 32K with 4,608 MiB FP16 KV vs 1,296 MiB Q4 KV, and total resident memory of 6,011 MiB vs 2,699 MiB. The companion Bonsai materials identify the 8B GGUF weights at roughly 1,099 MiB. Public references: Bonsai-demo, Turbo1Bit GitHub.| Method | Year | Bits (K/V) | Memory Reduction | Needs Calibration? | Quality Impact |
|---|---|---|---|---|---|
| FP16 (Baseline) | — | 16 / 16 | 1× | N/A | Baseline |
| INT8 Quantization | Various | 8 / 8 | 2× | Minimal | Negligible |
| KIVI | 2024 | 2 / 2 | ~8× | No (tuning-free) | Near-zero loss |
| KVQuant | 2024 | 3 / 3 | ~5× | Minimal | <0.1 perplexity |
| TurboQuant | 2025 | 2.5–3.5 | 4–7× | No (data-oblivious) | Near-zero loss |
FlashAttention is not a KV cache compression technique. But it changed how attention (and the KV cache) is read from GPU memory, making everything faster without touching the model architecture itself.
Tri Dao et al. observed that standard attention implementations waste enormous time moving data between GPU memory (HBM) and the fast on-chip memory (SRAM). FlashAttention uses a technique called "tiling": instead of computing the entire attention matrix at once (which requires materializing an enormous n×n matrix), it processes the attention in small blocks that fit in the fast on-chip SRAM.
Key result: 2–4× speedup over optimized baselines, with linear memory usage instead of quadratic. This means the n×n attention matrix is never fully materialized in GPU memory.
Dao improved upon the original with better parallelism and work partitioning across GPU thread blocks.
FlashAttention is now integrated into most major LLM frameworks (vLLM, Hugging Face Transformers, PyTorch) and is widely used for reducing memory overhead and improving throughput during both training and inference.
Let's bring everything together. The table below covers every major open model family from the original baseline through to mid-2026, showing exactly how each one handles the KV cache challenge.
| Model | Year | Params | Attention | Q Heads | KV Heads | Context | KV/Token (FP16) |
|---|---|---|---|---|---|---|---|
| LLaMA-2 7B | 2023 | 7B | MHA | 32 | 32 | 4K | 512 KB |
| LLaMA-2 70B | 2023 | 70B | GQA | 64 | 8 | 4K | 320 KB |
| Mistral 7B | 2023 | 7B | GQA+SWA | 32 | 8 | 32K (4K window) | 128 KB |
| Gemma 1 7B | 2024 | 7B | MHA | 16 | 16 | 8K | 448 KB |
| LLaMA-3 8B | 2024 | 8B | GQA | 32 | 8 | 128K | 128 KB |
| LLaMA-3 405B | 2024 | 405B | GQA | 128 | 8 | 128K | 504 KB |
| Qwen-2 72B | 2024 | 72B | GQA | 64 | 8 | 128K | 320 KB |
| Gemma 2 27B | 2024 | 27B | GQA+LG | 32 | 16 | 8K (4K local) | Reduced (~50%) |
| Phi-4 14B | 2024 | 14B | GQA | 24 | 8 | 16K | 128 KB |
| DeepSeek-V2 | 2024 | 236B (21B active) | MLA | 128 | Latent (dc=512) | 128K | ~Eq. 2.25 GQA groups |
| DeepSeek-V3 | 2024 | 671B (37B active) | MLA+MoE | 128 | Latent (dc=512) | 128K | 70 KB |
| Model | Year | Params | Attention | Q Heads | KV Heads | Context | Key Innovation |
|---|---|---|---|---|---|---|---|
| Gemma 3 27B | 2025 | 27B | GQA+LG | 32 | Varies | 128K (1K local) | 5:1 local-global ratio; only 1/6 layers full-attention |
| Qwen3 0.6B | 2025 | 0.6B | GQA | 16 | 8 | 32K | QK-Norm; no QKV-bias; tied embeddings |
| Qwen3 4B | 2025 | 4B | GQA | 32 | 8 | 32K | 36 layers; 4:1 Q-to-KV ratio |
| Qwen3 8B | 2025 | 8.2B | GQA | 32 | 8 | 128K | Thinking mode (hybrid reasoning) |
| Qwen3 32B | 2025 | 32.8B | GQA | 64 | 8 | 128K | 64 layers; 8:1 Q-to-KV ratio |
| Qwen3 235B-A22B | 2025 | 235B (22B active) | GQA+MoE | 64 | 4 | 128K | 128 experts × 8 active; 16:1 Q-to-KV |
| Qwen3.5 4B | 2025 | 4B | GQA | — | — | 128K | Latest Qwen iteration; improved reasoning |
| LLaMA 4 Scout | 2025 | 109B (17B active) | LongCtx+MoE | — | — | 10M (!) | Official model card lists a 10M-token context window; detailed KV internals are not broken out in this survey. |
| LLaMA 4 Maverick | 2025 | 400B (17B active) | LongCtx+MoE | — | — | 1M | Long-context multimodal MoE model; detailed KV internals not broken out here. |
| Kimi K2 | 2025 | 1.04T (32B active) | MLA+MoE | 64 | Latent (MLA) | 128K | 384 experts (8 active); MuonClip optimizer; 15.5T tokens |
| Gemma 4 E2B | 2026 | ~5.1B (2.3B eff.) | GQA+LG | 8 | 1 | 128K+ | PLE; KV sharing across 20/35 layers (num_kv_shared_layers=20) |
| Gemma 4 E4B | 2026 | ~8B (4B eff.) | GQA+LG | 8 | 2 | 128K+ | PLE; 42 layers; head_dim 256/512; sliding_window=512; num_kv_shared_layers=18 |
| Gemma 4 26B-A4B | 2026 | 26B (3.8B active) | GQA+LG+MoE | — | 8 local / 2 global | 128K+ | 128 experts (8 active) + 3× shared expert |
| Bonsai 8B | 2025 | 8.2B (1-bit weights!) | GQA+1-bit | — | — | 128K | 1.15 GB model; KV cache is FP16 (weights 1-bit, cache is not). +Turbo1Bit for Q4 KV. |
| Gemma 4 31B | 2026 | 31B (Dense) | GQA+LG | — | 16 (all layers) | 128K+ | 5:1 local-global; p-RoPE; unified K/V in global layers |
Below are KV cache memory calculations derived from published config.json files where available. Rows marked with * are estimates due to hybrid architectures (mixed head dimensions, KV sharing, or sliding windows) where an exact single number is not straightforward.
| Model | Layers | KV Heads | Head Dim | KV/Token | @ 1K Tokens | @ 128K Tokens | Calculation Notes |
|---|---|---|---|---|---|---|---|
| LLaMA-1 65B MHA |
80 | 64 | 128 | 2,560 KB | 2.5 GB | 320 GB | 2×80×64×128×2 = 2,621,440 B. Pure MHA: every head has its own K+V. Source: HF config. |
| Falcon-7B MQA |
32 | 1 | 64 | 8 KB | 8 MB | 1.0 GB | 2×32×1×64×2 = 8,192 B. All 71 query heads share 1 K + 1 V (multi_query=true). Source: HF config. |
| Mistral 7B GQA+SWA |
32 | 8 | 128 | 128 KB | 128 MB | 128 MB* | 2×32×8×128×2 = 131,072 B. *Sliding window caps cache at 4K tokens regardless of context. |
| Qwen3 235B GQA-4+MoE |
94 | 4 | 128 | 189 KB | 189 MB | 24 GB | 2×94×4×128×2 = 193,536 B. Only 4 KV heads for 64 query heads (16:1 ratio). |
| DeepSeek-V3 MLA+MoE |
61 | Latent (dc=512) | 576* | 69 KB | 69 MB | 8.6 GB | 61×(512+64)×2 = 70,272 B. Stores compressed latent, not raw K+V. *576 = dc+dR. |
| Gemma 4 31B GQA+LG |
60 (50 local + 10 global) | 16 (all layers) | 256 / 512 | ~1,120 KB naive | ~1.1 GB | ~41 GB† | †50 sliding layers (16 KV, d=256, capped at 1,024 tok) + 10 global layers (16 KV, d=512, full ctx). At 128K, global layers dominate: ~41 GB total. At 1K: ~1.1 GB. Source: official config.json. |
| Qwen3.5 9B GQA+DeltaNet |
32 (8 attn) | 4 | 256 | 32 KB | 32 MB | 4 GB | 2×8×4×256×2 = 32,768 B. Only 8/32 layers use attention; 24 use DeltaNet (no KV cache). | Same MLA config as DeepSeek-V3: kv_lora_rank=512, qk_rope_head_dim=64. 384 experts (vs 256 in DSV3). |
Qwen3.5-9B achieves remarkably low KV cache by using a hybrid DeltaNet + Attention architecture. Out of 32 layers, only 8 use traditional attention (which requires KV cache). The remaining 24 layers use Gated DeltaNet — a form of linear attention that maintains a fixed-size recurrent state instead of a growing cache. This means 75% of the model's layers add zero KV cache overhead regardless of context length. Combined with just 4 KV heads and head_dim=256, the result is just 32 KB per token — an astounding 16× less than the LLaMA-2 7B baseline of similar parameter count. At 128K context, that's only 4 GB for the entire KV cache.
Gemma 4 E4B uses just 2 KV heads and shares KV cache across 18 of its 42 layers (num_kv_shared_layers=18). This means only 24 layers maintain unique KV caches. Sliding window layers are capped at 512 tokens regardless of context length. The result: a 4B-effective model that can handle 128K context with roughly 2 GB of KV cache, putting it in the mobile/edge-friendly range.
Gemma 4 E4B is compelling because the base architecture already puts 128K context in roughly the 2 GB KV-cache range. That means the model is not relying on a heroic post-training trick to become usable on constrained hardware; it gets there by design through a very low KV-head count, cross-layer sharing, and local-global attention.
In our local turboquant-llamacpp measurements on Gemma 4 E4B running on an M2 Pro, TurboQuant-style 4-bit KV reduced cache memory from 104 MiB to 29 MiB at 4K, 296 MiB to 83 MiB at 16K, and 552 MiB to 155 MiB at 32K, staying near a 3.56× reduction throughout. If that same ratio holds at 128K, a ~2.1 GB FP16 KV budget lands around 600 MiB — which is exactly why Gemma-class models are so interesting for on-device long-context inference.
turboquant-llamacpp README: 104 MiB → 29 MiB (4K), 296 MiB → 83 MiB (16K), and 552 MiB → 155 MiB (32K) on an M2 Pro. Sources: Gemma 4 E4B model card, turboquant-llamacpp README.Despite scaling from 671B (DeepSeek-V3) to 1.04 trillion parameters (Kimi K2), the KV cache stays identical at ~70 KB per token. Both use MLA with kv_lora_rank=512 and qk_rope_head_dim=64 across 61 layers. The entire parameter increase went into more experts (384 vs 256) and expert capacity — not into attention overhead. This proves MLA's KV cost is decoupled from model scale.
1. The GQA Camp (LLaMA, Qwen, Gemma, Phi, Mistral): Nearly every major model family standardized on GQA with 8 KV heads. The Qwen3 235B MoE pushes this further to just 4 KV heads — a 16:1 compression ratio. Simple, proven, universally adopted.
2. The MLA Camp (DeepSeek, Kimi K2): These models compress KV into tiny latent vectors, achieving the lowest per-token cache cost. Kimi K2 took DeepSeek's MLA and scaled it to 1 trillion parameters with 384 experts — the largest open MLA model to date.
3. The Hybrid Attention Camp (LLaMA 4, Gemma 3/4): These models attack the problem architecturally by mixing short-range local attention with sparse global attention. Meta's public LLaMA 4 Scout model card advertises a 10 million token context window. Gemma 4 uses KV sharing across layers plus hybrid head dimensions (256 for local, 512 for global).
The real power comes from combining multiple techniques. A modern deployment might use:
GQA (4–8×) + Local-Global attention (3–5×) + TurboQuant (4–7×) + PagedAttention (eliminates waste) + SnapKV eviction (further reduction)
These compound: 8 × 4 × 5 = 160× theoretical reduction from the naive FP16 MHA baseline. No single production model uses all five simultaneously today, but the trend is clear: each new generation stacks more of these techniques together.
Everything we've covered so far treats the model as a single entity: one prompt in, one response out. But the newest frontier in AI — agentic workflows — changes the game entirely. An AI agent doesn't just answer a question; it plans, breaks problems into steps, calls external tools (web search, code execution, APIs), reads the results, reasons about them, and repeats this cycle potentially hundreds of times. This creates a completely different stress profile for KV cache.
Moonshot AI's Kimi K2.5 (February 2026) is the first major open model specifically designed to handle this agentic paradigm at scale. Its technical report lays out how reasoning, memory, and KV cache management work differently when a model operates as an agent rather than a chatbot.
In a standard chatbot conversation, the context grows slowly — one user message, one model response, repeat. The KV cache grows predictably and stays within the context window.
An agent is nothing like this. Consider an agent researching a topic:
After 200 tool calls, the context might contain millions of tokens — most of which are tool outputs that the model only needed briefly. The KV cache balloons, latency spikes, and eventually the model hits its context limit and breaks down.
Instead of a single agent chewing through a problem step-by-step (which creates one enormous, ever-growing KV cache), K2.5 introduces Agent Swarm: a framework where a central orchestrator breaks a complex task into independent sub-problems, then spawns multiple sub-agents to work on them simultaneously.
Each sub-agent runs in its own isolated context window with its own KV cache. Instead of one monstrous 500K-token context, you get ten 50K-token contexts running in parallel. When a sub-agent finishes, only its final output (a few hundred tokens) gets sent back to the orchestrator — not its full reasoning chain or tool outputs. The sub-agent's KV cache is then freed entirely.
This is essentially PagedAttention philosophy applied at the agent level: allocate memory where needed, free it when done, and never let any single context grow unbounded.
The result: 4.5× latency reduction over single-agent baselines on complex web search tasks, and a 17.8% accuracy improvement on BrowseComp (60.6% → 78.4%).
The orchestrator doesn't use hand-coded rules to decide when to spawn sub-agents. Instead, Moonshot trained it with a custom reinforcement learning method called Parallel-Agent RL (PARL). The reward function has three components:
A critical design choice: sub-agents are frozen (their weights don't update during PARL training). Only the orchestrator learns. This avoids a nasty problem in multi-agent RL where you can't tell which agent deserves credit for a good outcome.
Kimi K2.5 uses a training technique called Toggle that alternates between two modes:
Why this matters for KV cache: reasoning tokens are expensive. A "thinking" model like K2.5 might generate 36,000 internal reasoning tokens before producing a 200-token answer. Those 36K tokens all live in the KV cache. Toggle teaches the model to be frugal — use 7K tokens for easy sub-tasks, save the 36K budget for genuinely hard ones. At ~70 KB per token (MLA), that's the difference between 490 MB and 2.5 GB of KV cache just for the thinking phase.
For extremely long agent sessions, K2.5 uses a strategy called "Discard-all": after each major reasoning phase, the model discards the full context (including tool outputs) and starts fresh with only the accumulated conclusions. This effectively resets the KV cache periodically rather than letting it grow to the context limit.
On the BrowseComp benchmark, using Discard-all boosted accuracy from 60.6% to 74.9% — beating GPT-5.2's reported 65.8%, Claude Opus 4.5 (37.0%), and Gemini 3 Pro (37.8%). The surprising part: throwing away context helped, because a bloated context full of stale tool outputs actually confuses the model more than it helps.
| Aspect | Standard Chat Mode | Kimi K2.5 Agent Mode |
|---|---|---|
| Context growth | Linear, predictable | Bursty (tool outputs spike it), then reset |
| Max KV cache per session | Context window × 70 KB | Bounded by sub-agent window (~50K tokens) |
| Parallelism | One cache per user | Multiple caches (orchestrator + N sub-agents) |
| Cache lifetime | Entire conversation | Sub-agent caches freed after each sub-task |
| Reasoning overhead | Minimal (short CoT) | 7K–36K reasoning tokens per step (Toggle-managed) |
| Tool calls per session | 0–5 typically | Hundreds of tool calls across parallel sub-agents (paper reports up to 100 tool calls per sub-agent in BrowseComp) |
| Context resets | None | Periodic (Discard-all after each phase) |
Kimi K2.5 demonstrates that as AI shifts from "chatbots" to "agents," the KV cache problem transforms from a simple memory-scaling challenge into a dynamic memory management problem. The solutions aren't just about compressing each token's cache (MLA, GQA, quantization) — they're about deciding which contexts to keep alive, when to reset them, and how to distribute work across parallel cache instances. Agent Swarm is essentially the agentic equivalent of PagedAttention: instead of paging KV blocks within a single context, it pages entire contexts across multiple parallel agents.
As of early 2026, the KV cache problem that once threatened to halt the scaling of large language models has been substantially addressed through a multi-pronged approach. Here's where things stand:
1. Trillion-Parameter MLA: Moonshot AI's Kimi K2 pushed MLA to 1.04 trillion parameters with 384 experts — proving that DeepSeek's latent compression scales far beyond its original 236B deployment. With only 64 attention heads (vs DeepSeek-V3's 128), K2 squeezes even more efficiency out of MLA.
2. Ten-Million-Token Context: Meta's public LLaMA 4 Scout model card reports a 10 million token context window. That is roughly 5,000 pages of text in a single prompt, even though the public card does not spell out every attention/KV implementation detail behind it.
3. On-Device Intelligence: Gemma 4 E2B and E4B target mobile and edge deployments with 128K+ context. Their small KV budgets come from low KV-head counts, local-global attention, KV sharing across layers, and per-layer embeddings. Separately, Gemma 4 26B-A4B is the MoE family member that uses sparse routing (3.8B active of 26B); that MoE detail should not be attributed to the E2B/E4B variants. Post-training KV cache quantization methods (such as TurboQuant or KIVI) could further reduce memory, though Google's Gemma 4 documentation does not cite any specific KV quantization scheme as part of the on-device strategy.
4. Aggressive KV Head Reduction: Qwen3's MoE variant uses just 4 KV heads with 64 query heads — a 16:1 ratio. This trend toward extreme Q-to-KV ratios suggests the field is converging on the insight that you need very few KV heads, even for models with many query heads.
5. Data-Oblivious Quantization: TurboQuant requires no training data or model-specific tuning, making it a universal plug-in for any deployment. Combined with architectural improvements, effective KV cache reductions of 50–100× over the 2017 baseline are now routine.
| Metric | Naive FP16 MHA Baseline (e.g. LLaMA-2 7B, 2023) | 2026 (State of the Art) | Improvement |
|---|---|---|---|
| KV Cache per token (7B model) | 512 KB (FP16 MHA) | ~10–20 KB (MLA + quant) | 25–50× |
| Max context on single GPU | ~2K tokens | 1M+ tokens | 500× |
| Serving throughput | Baseline | 29× (with eviction) | 29× |
| Memory waste in serving | 60–80% | ~0% (PagedAttention) | Eliminated |
| Smallest device for 8B+ models | Data center GPUs | Smartphones (Bonsai 8B: 44 tok/s on iPhone) | Democratized |
The story of KV cache optimization is a story of the AI community refusing to accept limitations. When the "memory wall" threatened to stop progress, researchers attacked the problem from every angle: architectural changes (MQA, GQA, MLA), memory management (PagedAttention), numerical compression (KIVI, TurboQuant), intelligent caching (H2O, SnapKV), and attention pattern design (sliding window, local-global). Together, these advances have reduced the effective KV cache cost by 50–100×, enabling language models to run on devices that fit in your pocket while handling documents longer than entire novels.
Prefer a clean export over a live animation? Use the fast social-friendly recap or the slower full walkthrough below. The original interactive canvas is still here if you want to explore the whole visual journey inside the article.
A lighter export that works well for article readers and social previews. It covers the full progression from MHA to MLA without the slower pauses.
A slower export for readers who want time to follow the labels, cache counters, and architecture transitions frame by frame.