Glossary
Every term Loom uses, in one plain sentence. The same definitions pop up as tooltips across the site; here they all sit in one place. Search, or jump by category.
Execution model
Threads, warps, schedulers, and how work is placed on the chip.
- Eligible
- A warp that is ready to issue this cycle, not waiting on memory or a dependency.
- Latency hiding
- Keeping the machine busy during long waits by running other ready warps while one warp stalls.
- Lockstep
- All threads in a warp advance together on the same instruction at the same time.
- Occupancy
- The number of active warps on an SM divided by the maximum it can hold. More warps give the scheduler more to switch to.
- SIMT
- Single Instruction, Multiple Threads. All 32 threads of a warp run one shared instruction over their own data.
- SM
- Streaming Multiprocessor. A core building block of the GPU that runs thread blocks. An A100 has 108 of them.
- Stalled
- A warp that cannot issue yet because it is waiting, usually on a memory load.
- Sub-partition
- One of the four processing blocks inside an SM. Each has its own warp scheduler and execution units.
- tcgen05
- Blackwell’s 5th-generation tensor-core MMA. A single thread issues the matmul for the whole block, reading operands from shared memory and TMEM.
- Thread
- The smallest unit of work on a GPU. One thread runs one instance of the kernel on one lane.
- Thread block
- A group of threads that run on one SM and share its shared memory. Also called a CTA.
- Thread block cluster
- A Hopper group of blocks co-scheduled on one GPC that can read each other’s shared memory (distributed shared memory).
- Warp
- A group of 32 threads that execute the same instruction together, in lockstep. The scheduling unit of the GPU.
- Warp scheduler
- The unit that picks one eligible warp each cycle and issues its next instruction. Four per SM.
- Warp specialization
- Giving different warps different jobs: producer warps issue TMA loads while consumer warps run wgmma, overlapping load and compute.
- Warpgroup
- Four contiguous warps, 128 threads, the granularity Hopper wgmma operates on.
- wgmma
- Warpgroup matrix multiply. A Hopper instruction where 128 threads issue one asynchronous tensor-core matmul that reads its operands from shared memory.
Memory & data movement
The memory hierarchy and every trick for moving bytes cheaply.
- Bank
- One of the 32 slots shared memory is split into. Consecutive 4-byte words map to consecutive banks (word w lands in bank w mod 32), and each bank serves one word per cycle.
- Bank conflict
- When two or more threads in a warp want different words in the same bank. Their reads serialize.
- Coalescing
- When a warp reads neighbouring addresses so the hardware serves them in as few memory transactions as possible.
- cp.async
- An asynchronous copy from global memory straight into shared memory, without stalling the thread or passing through registers.
- Double buffering
- Using two shared-memory buffers that take turns: one is being computed on while the other is being filled by the next load.
- HBM
- High Bandwidth Memory, the large off-chip global memory. Biggest and slowest tier, hundreds of cycles away.
- L2 cache
- On-chip cache shared by all SMs, sitting between the per-SM L1 caches and global HBM.
- Local memory
- Per-thread memory that, despite the name, lives in slow off-chip global memory. Registers spill here when they run out.
- mbarrier
- An asynchronous barrier in shared memory. TMA signals it when a tile lands and waiting threads wake, handing each buffer stage from producer to consumer.
- Multicast
- A TMA mode that broadcasts one global load into several blocks’ shared memory in a cluster, so a shared operand crosses the bus only once.
- Prefetch
- Starting a load early, before the data is needed, so it has arrived by the time you use it.
- Register
- Per-thread on-chip storage. The fastest memory, about one cycle to access.
- Register file
- The pool of registers on an SM, shared out among all resident threads. About 256 KB on an A100.
- Register pressure
- How many registers a kernel needs per thread. High pressure means fewer warps fit, which lowers occupancy.
- Register spilling
- When a thread needs more registers than it has, the extra values spill to local memory, which actually lives in slow global memory.
- Sector
- The 32-byte unit the hardware fetches from global memory. A warp wants its data packed into as few sectors as possible.
- Software pipelining
- Overlapping the load of the next tile with the compute on the current one, so memory latency hides behind useful work.
- Swizzle
- A layout that permutes shared-memory addresses so a tile reads back bank-conflict-free and in the order the tensor cores want.
- Tensor map descriptor
- A small host-built struct (128 bytes) that tells TMA the tensor base, shape, strides, tile size, element type, and swizzle. One thread passes it to issue a bulk copy.
- Tensor Memory
- A dedicated on-SM memory on Blackwell (256 KB) that holds the MMA accumulator, so the register file no longer has to feed the tensor cores at FP4 rates.
- TMA
- Tensor Memory Accelerator. A Hopper copy engine that moves whole tensor tiles between global and shared memory from a single descriptor, so one thread issues the load.
Matmul, kernels & CuTe
The matrix multiply at the center of it all, and the software around it.
- Accumulator
- The running sum D in D = A × B that a matmul builds up across its K steps. Where it lives (registers or TMEM) is a recurring bottleneck.
- Arithmetic intensity
- FLOPs performed per byte moved from memory. Low intensity is memory-bound; tiling raises it until the op becomes compute-bound.
- Attention
- The operation that lets each token weigh every other token: score queries against keys, softmax the scores, then blend the values.
- CuTe
- The layout layer under CUTLASS. Expresses thread-to-data mappings as Shape ⊗ Stride and composes, tiles, and swizzles them at compile time.
- CUTLASS
- NVIDIA’s open template library for peak-performance GEMM and related kernels, built on CuTe layouts.
- Data reuse
- Using a value staged in fast memory many times before fetching new data, so slow global memory is touched as little as possible.
- GEMM
- General matrix multiply, C = A times B. The workhorse operation behind neural networks and the main thing GPUs are tuned for.
- Gradient
- The correction signal used to update a model during training. Gradients span a huge range of magnitudes, which is why their number format needs range.
- Layout
- A CuTe object, Shape ⊗ Stride, that maps a logical coordinate to a linear memory offset. Change the stride to re-lay-out data without moving it.
- Roofline
- A plot of attainable throughput against arithmetic intensity. An op is memory-bound under a bandwidth roof until enough reuse lifts it to the flat compute roof.
- Softmax
- Turns a row of numbers into positive weights that sum to 1, so they can act as attention weights or class probabilities.
- Stride
- How far apart, in memory, consecutive elements along an axis sit. A stride of 1 means contiguous, which is what makes a read coalesce.
- Tiling
- Loading a small block of a matrix into shared memory once so every thread in the block reuses it many times.
- Token
- One chunk of a model’s input or output, roughly a word or word-piece. Sequences are measured in tokens.
Number formats
How numbers are packed into bits and scaled.
- BF16
- Brain float 16: 8 exponent bits and 7 mantissa bits. Same range as FP32 with less precision, which is why it is the training default.
- Exponent
- A float's exponent bits set how big or small it can get. More exponent means more dynamic range.
- FP4
- A 4-bit floating-point format (E2M1). Packs eight values per 32 bits and roughly quadruples tensor-core peak over FP16, at the cost of range.
- FP8
- An 8-bit float in two flavors: E4M3 (more precision, forward pass) and E5M2 (more range, gradients). Introduced for tensor cores on Hopper.
- GGUF
- The llama.cpp file format that packs a model plus metadata for local inference. Holds k-quant tensor types like Q4_K_M.
- INT8
- An 8-bit integer format with evenly spaced steps and a shared scale. Cheap and tight when a tensor has no wild outliers.
- k-quant
- A GGUF quantization scheme: weights in super-blocks of 256, split into sub-blocks of 32, with a two-level (super-block and sub-block) scale.
- Mantissa
- A float's mantissa bits set how finely it resolves values between powers of two. More mantissa means more precision.
- Microscaling
- Storing a block of low-precision values (say FP4) with one shared low-precision block scale (E8M0 for MXFP4, FP8 for NVFP4), so tiny formats stay accurate. NVFP4 uses a block of 16.
- MXFP4
- The open OCP microscaling 4-bit float: one power-of-two scale (E8M0) shared across every block of 32 values.
- NVFP4
- NVIDIA's 4-bit float (E2M1) with one FP8 scale per 16 values plus a per-tensor FP32 scale. Finer blocks than MXFP4 for better accuracy.
- Quantization
- Storing weights or activations in fewer bits than they were trained in, usually with a shared scale factor to recover the real magnitudes.
Other
- Activation
- The data flowing through a model (the inputs and intermediate results), as opposed to the fixed weights. Activations carry outliers, which makes them harder to quantize.
- All-reduce
- A collective that sums a value across all GPUs and hands every GPU the total. Used to average gradients (data parallel) and combine activations (tensor parallel).
- All-to-all
- A collective where every GPU sends a different piece to every other GPU. Used to route tokens to their expert in expert parallelism.
- Attention sink
- A learned bias that keeps the always-important first tokens in the softmax denominator, letting a very small sliding window (like gpt-oss’s 128) stay stable.
- Context / sequence parallel
- Split the sequence across GPUs. Sequence parallel saves activation memory on the non-matmul parts; context parallel (ring attention) splits attention over ultra-long contexts.
- Data parallel
- Replicate the whole model on every GPU, split the batch, and average gradients with an all-reduce. DDP is the efficient PyTorch version.
- DSA
- DeepSeek Sparse Attention (V3.2): a lightning indexer scores past tokens and each query attends only to its top-k (2048), turning attention from order N squared into order N times k. Built on MLA.
- Expert parallel
- Spread a mixture-of-experts layer’s experts across GPUs and all-to-all the tokens to wherever their expert lives. DeepEP accelerates the shuffle.
- FlexAttention
- A PyTorch API for writing custom attention masks and score modifications as a small function that still compiles to one fused FlashAttention-style kernel.
- FLOP
- One floating-point operation, a single multiply or add. Throughput is measured in FLOPs per second (FLOP/s).
- FSDP / ZeRO
- Sharded data parallel: split the batch AND shard params, gradients, and optimizer state across GPUs, gathering each layer just in time. Trades communication for memory.
- GQA
- Grouped-query attention: query heads are split into groups, each sharing one Key/Value head. Near-MHA quality at close to MQA memory, and the mainstream default.
- KV cache
- The stored Keys and Values of every past token, kept so they are not recomputed each step. It dominates long-context memory and grows with the number of KV heads, head size, layers, and sequence length.
- MHA
- Multi-head attention: every query head has its own Key and Value head. Best quality, largest KV cache.
- MLA
- Multi-head latent attention (DeepSeek): compress every head's K and V into one small shared latent, cache only that, and reconstruct per-head K/V on the fly. GQA-level cache at MHA quality, with a small decoupled RoPE key.
- MMA
- Matrix multiply-accumulate: the tensor core’s core operation, D = A × B + C. wgmma (Hopper) and tcgen05.mma (Blackwell) are MMA instructions.
- MQA
- Multi-query attention: all query heads share a single Key/Value head. Smallest KV cache, but the hard sharing can cost quality.
- NoPE
- No positional encoding: a decoder-only causal model can infer position from the causal mask alone (a counting signal), with no explicit position input. Works only in the causal setting and has a finite usable range.
- Pipeline parallel
- Put different layers on different GPUs as stages; activations flow stage to stage. The idle time while the pipeline fills and drains is the "bubble."
- RoPE
- Rotary position embedding: encodes position by rotating the query and key vectors by an angle proportional to position, so the attention score depends on relative distance. Applied every layer, not added to embeddings.
- SageAttention
- Quantized attention: runs the attention matmuls in INT8 or FP4 on tensor cores, smoothing outlier channels first. The W4A4 idea applied to attention.
- SDPA
- PyTorch's scaled_dot_product_attention: a dispatcher that auto-picks a fused FlashAttention-style backend (flash, memory-efficient, or cuDNN) for your shapes and dtype.
- Sliding window attention
- Each token attends only to the last W tokens (a local band), capping cost and local KV cache. Stacking layers still compounds the reach, so the model is not limited to W.
- SonicMoE
- A mixture-of-experts kernel library that packs routed tokens into contiguous groups so each expert’s grouped-GEMM tile is full and efficient.
- Tensor core
- A dedicated unit inside the SM that multiplies a small matrix in one instruction, far faster than ordinary threads doing it multiply by multiply. First shipped on Volta.
- Tensor parallel
- Split each layer's weight matrices across GPUs (Megatron splits heads and MLP columns) and all-reduce the activations. Chatty, so it wants NVLink.
- Ternary (BitNet b1.58)
- Weights restricted to three values, -1, 0, and +1, about 1.58 bits each. The matmul becomes addition and subtraction with no multiplies; trained from scratch, not compressed after.
- W4A16
- A quantization recipe: 4-bit weights, 16-bit activations. Weight-only, so it saves memory but computes in 16-bit. GPTQ, AWQ, and GGUF are W4A16.
- W4A4
- A quantization recipe: 4-bit weights and 4-bit activations. The matmul runs on low-precision tensor cores, saving compute too, but activation outliers make it hard. NVFP4 is W4A4.
- YaRN
- A RoPE context-extension method: scale the rotation frequencies (NTK-style) and add an attention-temperature correction, so a model trained at one length works at a longer one.
- Thread
- The smallest unit of work on a GPU. One thread runs one instance of the kernel on one lane.
- Warp
- A group of 32 threads that execute the same instruction together, in lockstep. The scheduling unit of the GPU.
- SIMT
- Single Instruction, Multiple Threads. All 32 threads of a warp run one shared instruction over their own data.
- Lockstep
- All threads in a warp advance together on the same instruction at the same time.
- Coalescing
- When a warp reads neighbouring addresses so the hardware serves them in as few memory transactions as possible.
- Sector
- The 32-byte unit the hardware fetches from global memory. A warp wants its data packed into as few sectors as possible.
- HBM
- High Bandwidth Memory, the large off-chip global memory. Biggest and slowest tier, hundreds of cycles away.
- Shared memory
- Register
- Per-thread on-chip storage. The fastest memory, about one cycle to access.
- L2 cache
- On-chip cache shared by all SMs, sitting between the per-SM L1 caches and global HBM.
- SM
- Streaming Multiprocessor. A core building block of the GPU that runs thread blocks. An A100 has 108 of them.
- Sub-partition
- One of the four processing blocks inside an SM. Each has its own warp scheduler and execution units.
- Warp scheduler
- The unit that picks one eligible warp each cycle and issues its next instruction. Four per SM.
- Eligible
- A warp that is ready to issue this cycle, not waiting on memory or a dependency.
- Stalled
- A warp that cannot issue yet because it is waiting, usually on a memory load.
- Latency hiding
- Keeping the machine busy during long waits by running other ready warps while one warp stalls.
- Occupancy
- The number of active warps on an SM divided by the maximum it can hold. More warps give the scheduler more to switch to.
- Bank
- One of the 32 slots shared memory is split into. Consecutive 4-byte words map to consecutive banks (word w lands in bank w mod 32), and each bank serves one word per cycle.
- Bank conflict
- When two or more threads in a warp want different words in the same bank. Their reads serialize.
- Tiling
- Loading a small block of a matrix into shared memory once so every thread in the block reuses it many times.
- Data reuse
- Using a value staged in fast memory many times before fetching new data, so slow global memory is touched as little as possible.
- GEMM
- General matrix multiply, C = A times B. The workhorse operation behind neural networks and the main thing GPUs are tuned for.
- cp.async
- An asynchronous copy from global memory straight into shared memory, without stalling the thread or passing through registers.
- Software pipelining
- Overlapping the load of the next tile with the compute on the current one, so memory latency hides behind useful work.
- Double buffering
- Using two shared-memory buffers that take turns: one is being computed on while the other is being filled by the next load.
- Prefetch
- Starting a load early, before the data is needed, so it has arrived by the time you use it.
- Thread block
- A group of threads that run on one SM and share its shared memory. Also called a CTA.
- Register file
- The pool of registers on an SM, shared out among all resident threads. About 256 KB on an A100.
- Register pressure
- How many registers a kernel needs per thread. High pressure means fewer warps fit, which lowers occupancy.
- Register spilling
- When a thread needs more registers than it has, the extra values spill to local memory, which actually lives in slow global memory.
- Local memory
- Per-thread memory that, despite the name, lives in slow off-chip global memory. Registers spill here when they run out.
- TMA
- Tensor Memory Accelerator. A Hopper copy engine that moves whole tensor tiles between global and shared memory from a single descriptor, so one thread issues the load.
- Tensor map descriptor
- A small host-built struct (128 bytes) that tells TMA the tensor base, shape, strides, tile size, element type, and swizzle. One thread passes it to issue a bulk copy.
- mbarrier
- An asynchronous barrier in shared memory. TMA signals it when a tile lands and waiting threads wake, handing each buffer stage from producer to consumer.
- wgmma
- Warpgroup matrix multiply. A Hopper instruction where 128 threads issue one asynchronous tensor-core matmul that reads its operands from shared memory.
- Warpgroup
- Four contiguous warps, 128 threads, the granularity Hopper wgmma operates on.
- Warp specialization
- Giving different warps different jobs: producer warps issue TMA loads while consumer warps run wgmma, overlapping load and compute.
- Thread block cluster
- A Hopper group of blocks co-scheduled on one GPC that can read each other’s shared memory (distributed shared memory).
- Multicast
- A TMA mode that broadcasts one global load into several blocks’ shared memory in a cluster, so a shared operand crosses the bus only once.
- Tensor Memory
- A dedicated on-SM memory on Blackwell (256 KB) that holds the MMA accumulator, so the register file no longer has to feed the tensor cores at FP4 rates.
- tcgen05
- Blackwell’s 5th-generation tensor-core MMA. A single thread issues the matmul for the whole block, reading operands from shared memory and TMEM.
- Microscaling
- Storing a block of low-precision values (say FP4) with one shared low-precision block scale (E8M0 for MXFP4, FP8 for NVFP4), so tiny formats stay accurate. NVFP4 uses a block of 16.
- FP4
- A 4-bit floating-point format (E2M1). Packs eight values per 32 bits and roughly quadruples tensor-core peak over FP16, at the cost of range.
- Accumulator
- The running sum D in D = A × B that a matmul builds up across its K steps. Where it lives (registers or TMEM) is a recurring bottleneck.
- CuTe
- The layout layer under CUTLASS. Expresses thread-to-data mappings as Shape ⊗ Stride and composes, tiles, and swizzles them at compile time.
- CUTLASS
- NVIDIA’s open template library for peak-performance GEMM and related kernels, built on CuTe layouts.
- Layout
- A CuTe object, Shape ⊗ Stride, that maps a logical coordinate to a linear memory offset. Change the stride to re-lay-out data without moving it.
- Stride
- How far apart, in memory, consecutive elements along an axis sit. A stride of 1 means contiguous, which is what makes a read coalesce.
- Swizzle
- A layout that permutes shared-memory addresses so a tile reads back bank-conflict-free and in the order the tensor cores want.
- Roofline
- A plot of attainable throughput against arithmetic intensity. An op is memory-bound under a bandwidth roof until enough reuse lifts it to the flat compute roof.
- Arithmetic intensity
- FLOPs performed per byte moved from memory. Low intensity is memory-bound; tiling raises it until the op becomes compute-bound.
- Exponent
- A float's exponent bits set how big or small it can get. More exponent means more dynamic range.
- Mantissa
- A float's mantissa bits set how finely it resolves values between powers of two. More mantissa means more precision.
- BF16
- Brain float 16: 8 exponent bits and 7 mantissa bits. Same range as FP32 with less precision, which is why it is the training default.
- FP8
- An 8-bit float in two flavors: E4M3 (more precision, forward pass) and E5M2 (more range, gradients). Introduced for tensor cores on Hopper.
- Quantization
- Storing weights or activations in fewer bits than they were trained in, usually with a shared scale factor to recover the real magnitudes.
- INT8
- An 8-bit integer format with evenly spaced steps and a shared scale. Cheap and tight when a tensor has no wild outliers.
- Ternary (BitNet b1.58)
- Weights restricted to three values, -1, 0, and +1, about 1.58 bits each. The matmul becomes addition and subtraction with no multiplies; trained from scratch, not compressed after.
- NVFP4
- NVIDIA's 4-bit float (E2M1) with one FP8 scale per 16 values plus a per-tensor FP32 scale. Finer blocks than MXFP4 for better accuracy.
- MXFP4
- The open OCP microscaling 4-bit float: one power-of-two scale (E8M0) shared across every block of 32 values.
- GGUF
- The llama.cpp file format that packs a model plus metadata for local inference. Holds k-quant tensor types like Q4_K_M.
- k-quant
- A GGUF quantization scheme: weights in super-blocks of 256, split into sub-blocks of 32, with a two-level (super-block and sub-block) scale.
- Attention
- The operation that lets each token weigh every other token: score queries against keys, softmax the scores, then blend the values.
- Softmax
- Turns a row of numbers into positive weights that sum to 1, so they can act as attention weights or class probabilities.
- Token
- One chunk of a model’s input or output, roughly a word or word-piece. Sequences are measured in tokens.
- Gradient
- The correction signal used to update a model during training. Gradients span a huge range of magnitudes, which is why their number format needs range.
- Tensor core
- A dedicated unit inside the SM that multiplies a small matrix in one instruction, far faster than ordinary threads doing it multiply by multiply. First shipped on Volta.
- MMA
- Matrix multiply-accumulate: the tensor core’s core operation, D = A × B + C. wgmma (Hopper) and tcgen05.mma (Blackwell) are MMA instructions.
- FLOP
- One floating-point operation, a single multiply or add. Throughput is measured in FLOPs per second (FLOP/s).
- Activation
- The data flowing through a model (the inputs and intermediate results), as opposed to the fixed weights. Activations carry outliers, which makes them harder to quantize.
- W4A16
- A quantization recipe: 4-bit weights, 16-bit activations. Weight-only, so it saves memory but computes in 16-bit. GPTQ, AWQ, and GGUF are W4A16.
- W4A4
- A quantization recipe: 4-bit weights and 4-bit activations. The matmul runs on low-precision tensor cores, saving compute too, but activation outliers make it hard. NVFP4 is W4A4.
- SDPA
- PyTorch's scaled_dot_product_attention: a dispatcher that auto-picks a fused FlashAttention-style backend (flash, memory-efficient, or cuDNN) for your shapes and dtype.
- FlexAttention
- A PyTorch API for writing custom attention masks and score modifications as a small function that still compiles to one fused FlashAttention-style kernel.
- SageAttention
- Quantized attention: runs the attention matmuls in INT8 or FP4 on tensor cores, smoothing outlier channels first. The W4A4 idea applied to attention.
- SonicMoE
- A mixture-of-experts kernel library that packs routed tokens into contiguous groups so each expert’s grouped-GEMM tile is full and efficient.
- Data parallel
- Replicate the whole model on every GPU, split the batch, and average gradients with an all-reduce. DDP is the efficient PyTorch version.
- FSDP / ZeRO
- Sharded data parallel: split the batch AND shard params, gradients, and optimizer state across GPUs, gathering each layer just in time. Trades communication for memory.
- Tensor parallel
- Split each layer's weight matrices across GPUs (Megatron splits heads and MLP columns) and all-reduce the activations. Chatty, so it wants NVLink.
- Pipeline parallel
- Put different layers on different GPUs as stages; activations flow stage to stage. The idle time while the pipeline fills and drains is the "bubble."
- Expert parallel
- Spread a mixture-of-experts layer’s experts across GPUs and all-to-all the tokens to wherever their expert lives. DeepEP accelerates the shuffle.
- Context / sequence parallel
- Split the sequence across GPUs. Sequence parallel saves activation memory on the non-matmul parts; context parallel (ring attention) splits attention over ultra-long contexts.
- All-reduce
- A collective that sums a value across all GPUs and hands every GPU the total. Used to average gradients (data parallel) and combine activations (tensor parallel).
- All-to-all
- A collective where every GPU sends a different piece to every other GPU. Used to route tokens to their expert in expert parallelism.
- KV cache
- The stored Keys and Values of every past token, kept so they are not recomputed each step. It dominates long-context memory and grows with the number of KV heads, head size, layers, and sequence length.
- RoPE
- Rotary position embedding: encodes position by rotating the query and key vectors by an angle proportional to position, so the attention score depends on relative distance. Applied every layer, not added to embeddings.
- NoPE
- No positional encoding: a decoder-only causal model can infer position from the causal mask alone (a counting signal), with no explicit position input. Works only in the causal setting and has a finite usable range.
- Sliding window attention
- Each token attends only to the last W tokens (a local band), capping cost and local KV cache. Stacking layers still compounds the reach, so the model is not limited to W.
- Attention sink
- A learned bias that keeps the always-important first tokens in the softmax denominator, letting a very small sliding window (like gpt-oss’s 128) stay stable.
- MHA
- Multi-head attention: every query head has its own Key and Value head. Best quality, largest KV cache.
- MQA
- Multi-query attention: all query heads share a single Key/Value head. Smallest KV cache, but the hard sharing can cost quality.
- GQA
- Grouped-query attention: query heads are split into groups, each sharing one Key/Value head. Near-MHA quality at close to MQA memory, and the mainstream default.
- MLA
- Multi-head latent attention (DeepSeek): compress every head's K and V into one small shared latent, cache only that, and reconstruct per-head K/V on the fly. GQA-level cache at MHA quality, with a small decoupled RoPE key.
- DSA
- DeepSeek Sparse Attention (V3.2): a lightning indexer scores past tokens and each query attends only to its top-k (2048), turning attention from order N squared into order N times k. Built on MLA.
- YaRN
- A RoPE context-extension method: scale the rotation frequencies (NTK-style) and add an attention-temperature correction, so a model trained at one length works at a longer one.