Attention, up close

The kernel side made attention fast. This is the design side: the handful of choices that separate a 2019 transformer from a 2025 one, and why each exists.

First, what attention actually does

A transformer turns every token (a word or word-piece) into a vector, an embedding. On its own each embedding knows only the token itself, not the sentence around it. Attention is the step where those vectors talk to each other, so each one can pull in context from the others. It is the whole reason a transformer understands "bank" differently in "river bank" and "money bank."

The trick is that every token produces three vectors, by multiplying its embedding by three learned matrices:

Query (Q): what this token is looking for. Its question.
Key (K): what this token offers, to be matched against other tokens' questions.
Value (V): the actual information this token hands over if it gets attended to.

To decide how much token i should listen to token j, take the dot product of i's Query with j's Key. A big dot product means the question and the offer line up, so the score is high. Do that for every pair and you get a grid of scores. Divide each by the square root of the head size (this keeps the numbers from blowing up), run each row through a softmax so its weights are positive and sum to 1, and you have the attention pattern: for each token, how much of its attention lands on every other.

One rule for text generation: the causal mask. A token may attend only to itself and the tokens before it, never the future, because at generation time the future does not exist yet. That is why the grid below is triangular.

The last step: each token's new representation is the weighted sum of every Value, using its row of attention weights. A token that got 44% of the attention contributes 44% of its Value to the result. That blended vector, now soaked in context, is what moves on to the next layer.

Pick a query token below. Watch its row of the pattern, and the blend that becomes its new meaning. Notice that "creature" looks back at its adjectives "fluffy" and "blue," while the verb "roamed" looks back at its subject "creature." Nobody told it to; the Query and Key vectors just line up that way.

Query from:

afluffybluecreatureroamed a fluffy blue creature roamed

Attention pattern: each row is one query token; each cell is how much it attends to that column. Rows sum to 100%. The upper triangle is masked, a token cannot see the future.

"creature" attends mostly to "fluffy" (44%) and "blue" (40%). Its new representation is that weighted blend of Values.

Output for "creature" =

a weighted blend of every visible token's Value

Show as data

Attention weights (scaled dot-product + causal softmax) for the query "creature"
Key token	Raw score	Weight
a	0.23	6%
fluffy	2.14	44%
blue	2.05	40%
creature	0.57	9%

That is the entire mechanism. Everything else on this page is a modification of it. Two facts to carry forward:

This runs many times in parallel, each with its own Q/K/V, called heads. That is the "multi-head" in MHA, and the number of Key/Value heads is exactly what the KV cache section is about.
Computing the score grid, the softmax, and the value blend without ever storing the full grid in memory is exactly what FlashAttention does on the kernel side. Same math, fused.

Three questions every attention block answers

The core mechanism above (every token weighs every earlier token) has not changed since 2017. What changed is three design choices stacked on top of it, each solving a real problem:

How does the model know the order of tokens? Position encoding: RoPE, or the surprising NoPE.
How does long context fit in memory? Shrinking the KV cache: MHA to MQA to GQA to MLA.
How do you escape the quadratic cost? Locality and sparsity: sliding windows and DSA.

Take them one at a time. They are independent, and real models pick one answer to each.

Position: RoPE, and the NoPE surprise

Attention is a bag of dot products; on its own it has no idea which token came first. The original transformer added a position vector to the input embeddings. Modern models use RoPE (rotary position embedding) instead, and it works differently.

RoPE rotates the query and key vectors by an angle proportional to the token's position, inside every attention layer. Because both Q and K are rotated, their dot product ends up depending only on the relative offset between the two tokens. So absolute position goes in, but the attention score sees relative distance, with no extra parameters and no hard length limit. It is now near-universal: Llama, Mistral, Gemma, Qwen, and DeepSeek all use it.

RoPE does not extrapolate past its training length for free. The fix is to stretch the rotation frequencies: NTK-aware scaling raises RoPE's base frequency to keep local detail while reaching further, and YaRN adds an attention-temperature correction on top. Turning up the base frequency is exactly how models extend context (Gemma 3 uses a 1M base on its long-range layers).

The KV cache: from MHA to MLA

When a model generates text, it caches every past token's Key and Value so it does not recompute them each step. That KV cache is the memory that dominates long-context inference, and it grows with the number of KV heads times head size times layers times sequence length. Note the crucial asymmetry: query heads are never cached, so the entire race to shrink the cache targets the K and V side only.

Slide the context length. MHA balloons into hundreds of GB; watch GQA, MQA, and MLA cut it down, and read the quality each one trades away (or does not).

Context length: 131K tokens

KV cache for a 32-head, 32-layer model at 131K tokens (BF16)

MHA shares nothing: every query head has its own K and V head. It caches 68.7 GB here, a 1.0x cut from MHA. best quality, biggest cache.

Show as data

KV cache per variant at 131K tokens (32 heads, 32 layers, BF16)
Variant	KV heads	Cache	vs MHA
MHA	32	68.7 GB	1.0x
GQA	8	17.2 GB	4.0x
MQA	1	2.1 GB	32.0x
MLA	latent	4.8 GB	14.2x

MHA (multi-head). Every query head has its own K and V head. Best quality, biggest cache.
MQA (multi-query). All query heads share one K/V head. The cache shrinks by roughly the head count, but sharing so hard can cost quality.
GQA (grouped-query). The middle ground: split the query heads into groups, and each group shares one K/V head. MHA is GQA with one group per head; MQA is GQA with a single group. Near-MHA quality at close to MQA memory, which is why it is the default in Llama 3, Mistral, Gemma, Qwen, and gpt-oss.
MLA (multi-head latent, DeepSeek). A different lever. Instead of dropping KV heads, it compresses every head's K and V into one small shared latent vector, caches only that, and rebuilds the per-head K and V on the fly. Because no head is thrown away, it reaches GQA-level cache at MHA-level quality.

MLA has one subtlety worth knowing. RoPE is position-dependent, so it cannot be folded into the fixed reconstruction matrices. DeepSeek's fix is to split off a small decoupled RoPE key, shared across heads, and cache it beside the latent. So MLA caches two small things per token: the compressed latent and one little RoPE key.

Long context: sliding windows and DSA

A small KV cache still leaves attention itself quadratic: every token attends to every earlier one. Two ideas break that on long context, from opposite directions.

Sliding window attention (SWA) makes each token attend only to the last W tokens, a diagonal band. It caps the local cost and cache at W. Stacking layers still compounds the reach (roughly W times the number of layers), so the model is not actually limited to W. Real models interleave local and global layers: Mistral 7B used W = 4096 on every layer; Gemma 2 alternated local and global 1:1; Gemma 3 went 5:1 local-to-global with W = 1024; and gpt-oss alternates dense and sliding layers with a tiny 128-token window, held stable by attention sinks (a learned bias that keeps the always-important first tokens).

DSA (DeepSeek Sparse Attention, 2025) uses learned sparsity instead of a fixed window. A lightweight lightning indexer scores every past token for relevance, and each query then attends only to the top-k it picks (k = 2048). That turns the cost from order N squared into order N times k. It is built on top of MLA: the selected tokens' latent K/V are what attention reads. Unlike a sliding window, the choice is content-based and per-query, so any earlier token can be selected, near or far.

How real stacks mix them

The three axes are independent, so a modern model picks one answer to each (and usually adds MoE in the feed-forward):

DeepSeek-V3. MLA (compressed KV) + decoupled RoPE + MoE. V3.2 adds DSA on top of MLA for sparse long context.
Gemma 3. GQA + interleaved sliding/global attention (5:1, window 1024) + RoPE with a 1M base on the global layers.
gpt-oss. GQA (8 KV heads) + alternating dense/sliding (window 128) + attention sinks + RoPE extended with YaRN + MoE.
Llama 3 / Mistral. The industry baseline: GQA + RoPE (Mistral adds a sliding window).

Cheat sheet

The three attention design axes and their options
Axis	Options	What it decides
Position	RoPE, NoPE, (NTK / YaRN to extend)	how the model reads token order
KV heads	MHA, MQA, GQA, MLA	how big the KV cache is
Sparsity	full, sliding window, DSA	the long-context cost

Thread: The smallest unit of work on a GPU. One thread runs one instance of the kernel on one lane.
Warp: A group of 32 threads that execute the same instruction together, in lockstep. The scheduling unit of the GPU.
SIMT: Single Instruction, Multiple Threads. All 32 threads of a warp run one shared instruction over their own data.
Lockstep: All threads in a warp advance together on the same instruction at the same time.
Coalescing: When a warp reads neighbouring addresses so the hardware serves them in as few memory transactions as possible.
Sector: The 32-byte unit the hardware fetches from global memory. A warp wants its data packed into as few sectors as possible.
HBM: High Bandwidth Memory, the large off-chip global memory. Biggest and slowest tier, hundreds of cycles away.
Shared memory: Fast on-chip scratchpad private to one thread block. Split into 32 banks.
Register: Per-thread on-chip storage. The fastest memory, about one cycle to access.
L2 cache: On-chip cache shared by all SMs, sitting between the per-SM L1 caches and global HBM.
SM: Streaming Multiprocessor. A core building block of the GPU that runs thread blocks. An A100 has 108 of them.
Sub-partition: One of the four processing blocks inside an SM. Each has its own warp scheduler and execution units.
Warp scheduler: The unit that picks one eligible warp each cycle and issues its next instruction. Four per SM.
Eligible: A warp that is ready to issue this cycle, not waiting on memory or a dependency.
Stalled: A warp that cannot issue yet because it is waiting, usually on a memory load.
Latency hiding: Keeping the machine busy during long waits by running other ready warps while one warp stalls.
Occupancy: The number of active warps on an SM divided by the maximum it can hold. More warps give the scheduler more to switch to.
Bank: One of the 32 slots shared memory is split into. Consecutive 4-byte words map to consecutive banks (word w lands in bank w mod 32), and each bank serves one word per cycle.
Bank conflict: When two or more threads in a warp want different words in the same bank. Their reads serialize.
Tiling: Loading a small block of a matrix into shared memory once so every thread in the block reuses it many times.
Data reuse: Using a value staged in fast memory many times before fetching new data, so slow global memory is touched as little as possible.
GEMM: General matrix multiply, C = A times B. The workhorse operation behind neural networks and the main thing GPUs are tuned for.
cp.async: An asynchronous copy from global memory straight into shared memory, without stalling the thread or passing through registers.
Software pipelining: Overlapping the load of the next tile with the compute on the current one, so memory latency hides behind useful work.
Double buffering: Using two shared-memory buffers that take turns: one is being computed on while the other is being filled by the next load.
Prefetch: Starting a load early, before the data is needed, so it has arrived by the time you use it.
Thread block: A group of threads that run on one SM and share its shared memory. Also called a CTA.
Register file: The pool of registers on an SM, shared out among all resident threads. About 256 KB on an A100.
Register pressure: How many registers a kernel needs per thread. High pressure means fewer warps fit, which lowers occupancy.
Register spilling: When a thread needs more registers than it has, the extra values spill to local memory, which actually lives in slow global memory.
Local memory: Per-thread memory that, despite the name, lives in slow off-chip global memory. Registers spill here when they run out.
TMA: Tensor Memory Accelerator. A Hopper copy engine that moves whole tensor tiles between global and shared memory from a single descriptor, so one thread issues the load.
Tensor map descriptor: A small host-built struct (128 bytes) that tells TMA the tensor base, shape, strides, tile size, element type, and swizzle. One thread passes it to issue a bulk copy.
mbarrier: An asynchronous barrier in shared memory. TMA signals it when a tile lands and waiting threads wake, handing each buffer stage from producer to consumer.
wgmma: Warpgroup matrix multiply. A Hopper instruction where 128 threads issue one asynchronous tensor-core matmul that reads its operands from shared memory.
Warpgroup: Four contiguous warps, 128 threads, the granularity Hopper wgmma operates on.
Warp specialization: Giving different warps different jobs: producer warps issue TMA loads while consumer warps run wgmma, overlapping load and compute.
Thread block cluster: A Hopper group of blocks co-scheduled on one GPC that can read each other’s shared memory (distributed shared memory).
Multicast: A TMA mode that broadcasts one global load into several blocks’ shared memory in a cluster, so a shared operand crosses the bus only once.
Tensor Memory: A dedicated on-SM memory on Blackwell (256 KB) that holds the MMA accumulator, so the register file no longer has to feed the tensor cores at FP4 rates.
tcgen05: Blackwell’s 5th-generation tensor-core MMA. A single thread issues the matmul for the whole block, reading operands from shared memory and TMEM.
Microscaling: Storing a block of low-precision values (say FP4) with one shared low-precision block scale (E8M0 for MXFP4, FP8 for NVFP4), so tiny formats stay accurate. NVFP4 uses a block of 16.
FP4: A 4-bit floating-point format (E2M1). Packs eight values per 32 bits and roughly quadruples tensor-core peak over FP16, at the cost of range.
Accumulator: The running sum D in D = A × B that a matmul builds up across its K steps. Where it lives (registers or TMEM) is a recurring bottleneck.
CuTe: The layout layer under CUTLASS. Expresses thread-to-data mappings as Shape ⊗ Stride and composes, tiles, and swizzles them at compile time.
CUTLASS: NVIDIA’s open template library for peak-performance GEMM and related kernels, built on CuTe layouts.
Layout: A CuTe object, Shape ⊗ Stride, that maps a logical coordinate to a linear memory offset. Change the stride to re-lay-out data without moving it.
Stride: How far apart, in memory, consecutive elements along an axis sit. A stride of 1 means contiguous, which is what makes a read coalesce.
Swizzle: A layout that permutes shared-memory addresses so a tile reads back bank-conflict-free and in the order the tensor cores want.
Roofline: A plot of attainable throughput against arithmetic intensity. An op is memory-bound under a bandwidth roof until enough reuse lifts it to the flat compute roof.
Arithmetic intensity: FLOPs performed per byte moved from memory. Low intensity is memory-bound; tiling raises it until the op becomes compute-bound.
Exponent: A float's exponent bits set how big or small it can get. More exponent means more dynamic range.
Mantissa: A float's mantissa bits set how finely it resolves values between powers of two. More mantissa means more precision.
BF16: Brain float 16: 8 exponent bits and 7 mantissa bits. Same range as FP32 with less precision, which is why it is the training default.
FP8: An 8-bit float in two flavors: E4M3 (more precision, forward pass) and E5M2 (more range, gradients). Introduced for tensor cores on Hopper.
Quantization: Storing weights or activations in fewer bits than they were trained in, usually with a shared scale factor to recover the real magnitudes.
INT8: An 8-bit integer format with evenly spaced steps and a shared scale. Cheap and tight when a tensor has no wild outliers.
Ternary (BitNet b1.58): Weights restricted to three values, -1, 0, and +1, about 1.58 bits each. The matmul becomes addition and subtraction with no multiplies; trained from scratch, not compressed after.
NVFP4: NVIDIA's 4-bit float (E2M1) with one FP8 scale per 16 values plus a per-tensor FP32 scale. Finer blocks than MXFP4 for better accuracy.
MXFP4: The open OCP microscaling 4-bit float: one power-of-two scale (E8M0) shared across every block of 32 values.
GGUF: The llama.cpp file format that packs a model plus metadata for local inference. Holds k-quant tensor types like Q4_K_M.
k-quant: A GGUF quantization scheme: weights in super-blocks of 256, split into sub-blocks of 32, with a two-level (super-block and sub-block) scale.
Attention: The operation that lets each token weigh every other token: score queries against keys, softmax the scores, then blend the values.
Softmax: Turns a row of numbers into positive weights that sum to 1, so they can act as attention weights or class probabilities.
Token: One chunk of a model’s input or output, roughly a word or word-piece. Sequences are measured in tokens.
Gradient: The correction signal used to update a model during training. Gradients span a huge range of magnitudes, which is why their number format needs range.
Tensor core: A dedicated unit inside the SM that multiplies a small matrix in one instruction, far faster than ordinary threads doing it multiply by multiply. First shipped on Volta.
MMA: Matrix multiply-accumulate: the tensor core’s core operation, D = A × B + C. wgmma (Hopper) and tcgen05.mma (Blackwell) are MMA instructions.
FLOP: One floating-point operation, a single multiply or add. Throughput is measured in FLOPs per second (FLOP/s).
Activation: The data flowing through a model (the inputs and intermediate results), as opposed to the fixed weights. Activations carry outliers, which makes them harder to quantize.
W4A16: A quantization recipe: 4-bit weights, 16-bit activations. Weight-only, so it saves memory but computes in 16-bit. GPTQ, AWQ, and GGUF are W4A16.
W4A4: A quantization recipe: 4-bit weights and 4-bit activations. The matmul runs on low-precision tensor cores, saving compute too, but activation outliers make it hard. NVFP4 is W4A4.
SDPA: PyTorch's scaled_dot_product_attention: a dispatcher that auto-picks a fused FlashAttention-style backend (flash, memory-efficient, or cuDNN) for your shapes and dtype.
FlexAttention: A PyTorch API for writing custom attention masks and score modifications as a small function that still compiles to one fused FlashAttention-style kernel.
SageAttention: Quantized attention: runs the attention matmuls in INT8 or FP4 on tensor cores, smoothing outlier channels first. The W4A4 idea applied to attention.
SonicMoE: A mixture-of-experts kernel library that packs routed tokens into contiguous groups so each expert’s grouped-GEMM tile is full and efficient.
Data parallel: Replicate the whole model on every GPU, split the batch, and average gradients with an all-reduce. DDP is the efficient PyTorch version.
FSDP / ZeRO: Sharded data parallel: split the batch AND shard params, gradients, and optimizer state across GPUs, gathering each layer just in time. Trades communication for memory.
Tensor parallel: Split each layer's weight matrices across GPUs (Megatron splits heads and MLP columns) and all-reduce the activations. Chatty, so it wants NVLink.
Pipeline parallel: Put different layers on different GPUs as stages; activations flow stage to stage. The idle time while the pipeline fills and drains is the "bubble."
Expert parallel: Spread a mixture-of-experts layer’s experts across GPUs and all-to-all the tokens to wherever their expert lives. DeepEP accelerates the shuffle.
Context / sequence parallel: Split the sequence across GPUs. Sequence parallel saves activation memory on the non-matmul parts; context parallel (ring attention) splits attention over ultra-long contexts.
All-reduce: A collective that sums a value across all GPUs and hands every GPU the total. Used to average gradients (data parallel) and combine activations (tensor parallel).
All-to-all: A collective where every GPU sends a different piece to every other GPU. Used to route tokens to their expert in expert parallelism.
KV cache: The stored Keys and Values of every past token, kept so they are not recomputed each step. It dominates long-context memory and grows with the number of KV heads, head size, layers, and sequence length.
RoPE: Rotary position embedding: encodes position by rotating the query and key vectors by an angle proportional to position, so the attention score depends on relative distance. Applied every layer, not added to embeddings.
NoPE: No positional encoding: a decoder-only causal model can infer position from the causal mask alone (a counting signal), with no explicit position input. Works only in the causal setting and has a finite usable range.
Sliding window attention: Each token attends only to the last W tokens (a local band), capping cost and local KV cache. Stacking layers still compounds the reach, so the model is not limited to W.
Attention sink: A learned bias that keeps the always-important first tokens in the softmax denominator, letting a very small sliding window (like gpt-oss’s 128) stay stable.
MHA: Multi-head attention: every query head has its own Key and Value head. Best quality, largest KV cache.
MQA: Multi-query attention: all query heads share a single Key/Value head. Smallest KV cache, but the hard sharing can cost quality.
GQA: Grouped-query attention: query heads are split into groups, each sharing one Key/Value head. Near-MHA quality at close to MQA memory, and the mainstream default.
MLA: Multi-head latent attention (DeepSeek): compress every head's K and V into one small shared latent, cache only that, and reconstruct per-head K/V on the fly. GQA-level cache at MHA quality, with a small decoupled RoPE key.
DSA: DeepSeek Sparse Attention (V3.2): a lightning indexer scores past tokens and each query attends only to its top-k (2048), turning attention from order N squared into order N times k. Built on MLA.
YaRN: A RoPE context-extension method: scale the rotation frequencies (NTK-style) and add an attention-temperature correction, so a model trained at one length works at a longer one.