Loom

Follow the data through the GPU. A warp is 32 threads run in lockstep. The word comes straight from weaving, where the warp is the set of threads held in parallel on a loom.

Loom follows a single piece of data on its trip through an NVIDIA GPU, from the slow memory far away to the math units that crunch it, and shows you why the fast code is fast.

No CUDA required. Every idea here is a picture you can play with. Drag a slider, press play, predict what happens, then reveal it. Small quizzes and mnemonics help it stick.

What's inside

  1. Phase 0 Coalescing. Watch a warp load from global memory. See why reading neighbouring addresses is fast and scattering is slow.
  2. Phase 1 The SM. Meet the warp scheduler that hides memory latency, then beat a shared-memory bank conflict.
  3. Phase 2 Tiled matmul. Put it together: tile for reuse, pipeline with cp.async, and pad away the last conflict.
  4. Phase 3 Occupancy. How many warps actually fit, and why registers and shared memory pull against each other.
  5. Phase 4-5 Hopper & Blackwell. Re-choreograph the same matmul: one thread fires a TMA load, warps specialize, and the numbers shrink to FP4.
  6. Phase 6 CuTe layouts. The capstone: the Shape × Stride algebra that every kernel is really written in.
  7. Advanced Real kernels. The roofline, and how FlashAttention and MoE put it all to work.

Two side trips branch off the main path once you have the basics: the number zoo (BF16 to FP4, NVFP4, and the GGUF quants on your laptop) and real kernels (FlashAttention, DeepGEMM, MoE up close).

By the end you will know the vocabulary the rest of the field assumes, warp, coalescing, occupancy, tiling, and the one idea underneath all of it: move the bytes you need, and no more.

None of the unfamiliar words above need to mean anything yet. Each is unpacked from scratch when you reach it. Start at the top and scroll. Each scene builds on the one before it.

Memory coalescing, one warp at a time

Every GPU performance story starts here.

A warp is a group of 32 threads that run together in lockstep. Each of the 32 loads a word (4 bytes) from global memory. The hardware does not fetch 32 little pieces. It fetches aligned 32-byte sectors and hands each thread its slice.

Read neighbouring words and a whole sector is useful. That is a coalesced load. Stride apart and each sector arrives carrying mostly data nobody asked for.

Start small. Set the warp to 1 thread, press play, then step up to 4 and then 32. Then flip to strided and predict the damage before you reveal it.

Memory coalescing: one warp loading from global memorycoalesced access, 32 of 32 lanes active: 4 sectors fetched vs 4 ideal, 100% bus utilization.Warp: 32 lanes in lockstepGlobal memory: 32-byte sectorsShared memory: one slot per lane
coalesced access, 32 of 32 lanes active: 4 sectors fetched vs 4 ideal, 100% bus utilization.
Lanes (kiddy pool → full warp)
Access pattern
Timeline
Show this scene as data (screen-reader & LLM friendly)

Pattern: coalesced. Active lanes: 32. Sectors fetched: 4 (ideal 4). Useful bytes 128 of 128 fetched, giving 100% bus utilization.

Per-lane memory access for the current pattern
LaneWordByte addressSector
000–30
114–70
228–110
3312–150
4416–190
5520–230
6624–270
7728–310
8832–351
9936–391
101040–431
111144–471
121248–511
131352–551
141456–591
151560–631
161664–672
171768–712
181872–752
191976–792
202080–832
212184–872
222288–912
232392–952
242496–993
2525100–1033
2626104–1073
2727108–1113
2828112–1153
2929116–1193
3030120–1233
3131124–1273

Why the strided case hurts

A coalesced 32-lane load needs exactly four sectors. 128 bytes in, 128 bytes used, 100% of the bus doing real work.

The strided load touches a fresh sector for every lane. That is 32 sectors, 1024 bytes moved to deliver the same 128 useful bytes.

Same data, eight times the traffic. Eight times the wait on the slowest memory in the machine.

That one idea, move the bytes you need and no more, is the seed of tiling, cp.async, TMA, and everything Loom builds next.

Quick checkA 32-lane strided load touches a different sector for every lane. How much of each 32-byte sector actually gets used?

So who does the waiting?

A global load takes hundreds of cycles.

If a thread just sat and waited, the machine would idle most of the time. It does not.

On an A100 or H100, an SM keeps up to 64 warps resident at once. Its four warp schedulers each issue one instruction per cycle from an eligible warp, one that is not currently waiting on memory.

While one warp is stalled on memory, the scheduler selects another. That is latency hiding, and it is why GPUs want lots of threads.

Switching costs nothing. Every warp's registers stay live in hardware.

Drag the warp count up from 1 and watch the idle time vanish. Amber cells are warps stalled on memory. Teal is the one the scheduler selected this cycle. With too few warps it hits bubbles, cycles where nobody is eligible.

One warp scheduler hiding memory latency2 resident warps on one scheduler: it selected an eligible warp on 31% of cycles. In this compressed toy timeline about 6 warps hide the memory latency; a real global load is hundreds of cycles, which is why an SM holds up to 16 warps per scheduler (64 per SM).Cycles →warp 0warp 1selected warp 0

One scheduler shown (an A100/H100 SM has 4, up to 16 warps each). Cycle counts are compressed for visibility. A real global load is ~400–800 cycles, which is exactly why an SM keeps up to 64 warps ready.

2 resident warps on one scheduler: it selected an eligible warp on 31% of cycles. In this compressed toy timeline about 6 warps hide the memory latency; a real global load is hundreds of cycles, which is why an SM holds up to 16 warps per scheduler (64 per SM).

Your turn: kill the bank conflict

Shared memory is fast, but it is split into 32 banks, and a warp can read only one word per bank per cycle. Consecutive 4-byte words are spread across the banks in order, so word w sits in bank w mod 32, wrapping back to bank 0 every 32 words. Two threads that want different words in the same bank are a conflict, and the reads serialize.

Hit 32 different banks and the whole warp reads at once. Collide, and the read serializes into one transaction per pile-up.

Below, 32 threads read smem[thread × stride]. Find the stride that keeps all 32 banks distinct.

Puzzle. 32 threads read shared memory as smem[thread × stride]. Pick the stride that sends all 32 threads to 32 different banks, so the whole warp reads in a single cycle.
Shared-memory bank conflicts for stride 2Stride 2: 2 threads collide on each used bank. That is a 2-way conflict, so this read serializes into 2 transactions.32 shared-memory banks →0481216202428
16 of 32 banks used · 2-way

Stride 2: 2 threads collide on each used bank. That is a 2-way conflict, so this read serializes into 2 transactions. Even strides share a factor with 32, so threads pile up.

Show as data
Thread → word → bank at stride 2
ThreadWordBank
000
122
244
366
488
51010
61212
71414
81616
91818
102020
112222
122424
132626
142828
153030
16320
17342
18364
19386
20408
214210
224412
234614
244816
255018
265220
275422
285624
295826
306028
316230

Follow the data: a tiled matmul

Now we put the pieces together on the job GPUs are built for: matrix multiply.

Each output number is a row times a column. Done naively, every thread streams a whole row and column from global memory, and neighbours re-read the exact same values.

That is the trap. The math is cheap, but the machine spends all its time waiting on memory. Matmul is memory-bound until you fix the reuse.

The fix is tiling. Load a small tile of A and a tile of B into shared memory once, then let every thread in the block (the group of warps that runs together on one SM and shares its fast memory) reuse them many times before moving on.

Click any output tile in C. It is a running sum. Sweep K and watch it pull one tile of A and one tile of B per step, add them in, and fill up.

Tiled matrix multiply: accumulating one output tileOutput tile row 1, column 1: step 1 of 4, staging A tile (row 1, col 0) and B tile (row 0, col 1). 25% accumulated.ABC = A × B

Output tile row 1, column 1: step 1 of 4, staging A tile (row 1, col 0) and B tile (row 0, col 1). 25% accumulated.

Each number staged in shared memory is reused 32×. That turns 65,536 global reads into 2,048.

Quick checkA square tile is 32 elements on a side. Roughly how many times is each element you copy into shared memory reused before the block moves on?

Don't wait for the load. Prefetch it.

Tiling cut the traffic. But each tile still has to arrive before you can use it.

The naive loop stalls on every load: fetch a tile, sit idle, multiply, repeat. The slow memory shows up as dead time between every step.

cp.async changes that. It copies a tile from global straight into shared memory asynchronously, without parking the thread and without going through registers. An ordinary load routes each word global → register → shared, tying up the thread and a register the whole time; cp.async skips straight to shared, so the thread can keep working.

So while you multiply the current tile, you prefetch the next one. Two buffers, taking turns. The load hides behind the compute.

Run both schedules on one clock. Same work, same tiles. Watch the pipelined run cross the finish line while the synchronous one is still stalling.

Synchronous copy versus cp.async double-buffered pipelineThe same 4 tiles, loaded and multiplied. Synchronous finishes in 24 cycles; the cp.async pipeline finishes in 15, about 1.6 times faster, because loads overlap compute.Synchronousloadcomputecp.async · double-bufferedloadcompute
Same work, two schedules. Watch the pipeline finish first.

Boss: the tile fights back

The tile is staged in shared memory. Now the threads read a column of it to do the multiply.

Remember the 32 banks from Phase 1? Stepping down a column jumps a full row, 32 elements, each time. With 32 banks, that stride sends every thread in a column to the same bank. That is a 32-way conflict, the worst case.

Fix it with the oldest trick in the book: a little padding.

Boss puzzle. Your staged tile is 32 columns wide. 32 threads read one column, smem[thread × width]. Add the smallest padding that gives all 32 threads their own bank.
Bank usage reading a column of a width-32 tileWidth 32: 32 threads pile onto each used bank, a 32-way conflict. The read serializes into 32 transactions.32 shared-memory banks →0481216202428
width 32 · 1 of 32 banks · 32-way

Width 32: 32 threads pile onto each used bank, a 32-way conflict. The read serializes into 32 transactions. An even width shares a factor with 32, so threads collapse onto the same banks.

Show as data
Thread → word → bank reading a column at width 32
ThreadWordBank
000
1320
2640
3960
41280
51600
61920
72240
82560
92880
103200
113520
123840
134160
144480
154800
165120
175440
185760
196080
206400
216720
227040
237360
247680
258000
268320
278640
288960
299280
309600
319920

That is the Ampere playbook: tile for reuse, pipeline with cp.async, and pad to keep shared memory conflict-free. Hopper and Blackwell rebuild this same dataflow with new hardware. That is where Loom goes next.

How many warps actually fit?

Phase 1 sold you on warps: the more you have, the better the scheduler hides latency.

So why not run the maximum every time? Because warps are not free.

Each SM has a fixed budget: one register file and one slab of shared memory. Every thread's registers come out of that file. Every block's shared memory comes out of that slab.

Ask for more registers per thread, and fewer warps fit. Ask for more shared memory per block, and fewer blocks fit. Occupancy is the fraction of the maximum warps you manage to keep resident.

Drag the sliders. Push registers per thread up and watch the warps vanish. The meter that turns amber is the resource that ran out first.

64 / 64 warps 100% occupancy

Register file 100%
Shared memory 78%

✓ 64 warps resident. Plenty for the scheduler to hide latency. Limited by registers.

64 of 64 warps resident, 100% occupancy, limited by registers. 8 blocks of 8 warps each.
Quick checkYou raise registers per thread from 32 to 64 so each thread recomputes less. Occupancy drops from 100% to 50%. Win or loss?

This is the hinge between the last two phases. Latency hiding (Phase 1) wants many warps. Tiling (Phase 2) spends shared memory to get reuse. Both pull on the same fixed budget, and occupancy is where you balance them.

Hopper: let one thread do the load

Phase 2's Ampere loop worked, but look at who was doing the loading. Every one of the block's threads computed its own addresses and fired a cp.async. That address arithmetic burns registers, the very budget Phase 3 showed is scarce.

Hopper adds a dedicated copy engine, the TMA (Tensor Memory Accelerator). Now a single thread hands it a descriptor, a small struct that says where the tensor lives, its shape and strides, the tile to grab, and how to swizzle it (permute the shared-memory addresses so the load lands conflict-free). The engine does all the address math.

When the tile lands, TMA flips an mbarrier, a hardware flag threads can wait on, to wake the waiting threads. The other 127 never touched an address. They are free to compute.

Flip between Ampere and Hopper. Watch the block go from every thread grinding on address math to one thread firing a descriptor and the rest set free.

Ampere: all 128 threads compute their own addresses and fire cp.async, spending roughly 32 registers each on address arithmetic.

Split the warps: producer and consumer

With TMA doing the copy, you can give whole warps different jobs. This is warp specialization. One group of warps, the producer, does nothing but fire TMA loads into a multi-stage shared-memory buffer. Another group, the consumer, does nothing but run wgmma, the warpgroup matrix multiply, draining that buffer. A warpgroup is four warps, 128 threads, that together issue one instruction to the tensor cores: the dedicated units inside the SM that multiply a small matrix in a single step, far faster than threads doing it one multiply at a time. Every phase from here runs on them.

An mbarrier per stage hands each tile from producer to consumer. Because the buffer is a few stages deep, the producer can run several tiles ahead. Once it is primed, the tensor cores never wait.

Run the pipeline. The amber lane is TMA loading tiles; the teal lane is wgmma multiplying them. After the buffer fills, the teal lane runs solid to the end. That is a saturated tensor core.

Warp-specialized TMA to wgmma pipelineA depth-3 buffer feeding wgmma. The producer streams 6 TMA loads while the consumer runs the matmul on already-staged tiles. Warp-specialized finishes in 20 cycles versus 30 for load-then-compute, about 1.5 times faster, and the tensor cores stay busy from cycle 2 on.producerTMA loadconsumerwgmmashared-memory buffer · 3 stagestile 0
Priming the buffer. Producer fills, consumer waits for the first tile.
Quick checkwgmma reads its operands straight from shared memory and runs asynchronously. Why does that matter for the pipeline?

Same GEMM, new machine. The bytes still flow global → shared → tensor core, but Hopper moves the address math into hardware and lets warps specialize. Blackwell goes one step further and moves the accumulator out of the registers too. That is next.

Blackwell: make the numbers smaller

Every phase so far has moved bytes faster. Blackwell also makes each number smaller. FP8 and FP4 are ordinary floating-point numbers squeezed into 8 or 4 bits instead of the usual 16. A matmul in FP8 moves half the bytes of FP16 and runs the tensor cores about twice as fast. FP4 halves it again.

The catch is accuracy: four bits can barely tell numbers apart. The fix is microscaling. Store a block of values in FP4, but share one small scale factor (an 8-bit number, FP8 in NVFP4) across every 16 of them, plus one full-precision scale for the whole tensor. NVIDIA's NVFP4 does exactly this.

Pick a format. Watch how many values fit in one 32-byte chunk and how the tensor-core peak grows as the numbers shrink. For FP4, the amber ticks mark each shared scale.

FP16 is 16 bits per value, so 16 values pack into one 32-byte chunk and the tensor core runs about 1x FP16 peak.

Show as data
Format width, packing, and relative peak
FormatBitsValues / 32 BPeak vs FP16
FP161616
FP8832
FP66422.6666666666666665×
FP4464

Move the accumulator out of the registers

Here is the surprise. Feed the tensor cores at FP4 rates and the bottleneck stops being memory. It becomes the register file itself. The accumulator, the running sum D += A × B, cannot be read and written fast enough from registers.

So Blackwell gives it a dedicated home: Tensor Memory, or TMEM. That is 256 KB per SM, with its own bandwidth. The accumulator lives there for the whole matmul instead of clogging the registers.

And the instruction that drives it, tcgen05.mma (an MMA, matrix multiply-accumulate, the tensor core's core operation D = A × B + C), is issued by a single thread on behalf of the entire block. On Hopper 128 threads cooperated on a wgmma; on Blackwell one thread fires the MMA and the tensor core reads straight from shared memory and TMEM. A pair of SMs can even share one MMA, each staging half the operands.

Step the precision down and watch the accumulator's bandwidth demand climb. Past FP8 it crosses the register-file limit, and the accumulator hops to TMEM.

At FP16, the accumulator needs about 1x the bandwidth it did at FP16. The register file can still supply that, so the accumulator stays in registers.

Schematic: the ceiling is illustrative, not a measured bandwidth.

Quick checkWhy did Blackwell add a separate Tensor Memory for the accumulator instead of just using more registers?

That closes the hardware tour: coalescing, the scheduler, tiling, occupancy, TMA, and now tensor memory. One last idea ties the thread-to-data mapping together, the algebra the kernels are actually written in. That is the capstone: CuTe layouts.

Capstone: the layout is the whole trick

Look back at the journey. Coalescing was about which address each thread reads. Tiling was about where a tile sits in shared memory. Swizzling and padding reshuffled that placement to dodge bank conflicts. TMA carried a descriptor of it. Every one of those is the same thing: a mapping from a logical coordinate to a memory offset.

CuTe, the foundation of CUTLASS (NVIDIA's library for fast matmul kernels), writes that mapping as one object: Layout = Shape ⊗ Stride. The shape says how big each axis is. The stride says how far apart its elements sit. Feed it a coordinate (i, j) and it returns an offset, i·strideᵢ + j·strideⱼ. That is the entire idea.

The power is that shape and stride are independent. Keep the same logical grid and change only the strides, and the data is suddenly row-major, column-major, or padded, without a single element moving.

Flip the layout between row-major and column-major and watch every offset renumber. Then choose whether a warp reads a row or a column, and the coalescing verdict from Phase 0 drops straight out of the stride.

Layout
A warp reads a

Layout = Shape Stride = (4, 8) (8, 1) · offset = i·8 + j·1

✓ Consecutive threads are 1 apart: offsets 0, 1, 2, 3, 4, 5, 6, 7. One coalesced transaction, the fast path from Phase 0.

Show as data
Thread order → logical coord → memory offset for the selected row
ThreadCoordOffset
0(0, 0)0
1(0, 1)1
2(0, 2)2
3(0, 3)3
4(0, 4)4
5(0, 5)5
6(0, 6)6
7(0, 7)7
Quick checkYour data is stored row-major, but your warp needs to read down a column. What happens, and what is the classic fix?

This is why real kernels are written in CuTe. Tensor cores demand exact, swizzled thread-to-data mappings for TMA and wgmma, and hand-deriving them is a footgun. The layout algebra composes and tiles them at compile time, portably across Ampere, Hopper, and Blackwell.

You now have the whole machine and the language its kernels speak. The next track flips from the hardware to the software: how production kernels like FlashAttention, DeepGEMM, and the MoE libraries put every one of these ideas to work. It starts with the one picture that governs all of them, the roofline.

The payoff: one picture for every kernel

You have the machine. Now the question every real kernel asks: am I limited by the math or by the memory? The roofline answers it with a single plot.

The x-axis is arithmetic intensity, how many FLOPs (floating-point operations, the multiplies and adds) you do per byte you move. The roof has two parts. A sloped bandwidth roof on the left: if you touch memory a lot per flop, bandwidth caps you. A flat compute roof on the right: once you reuse enough, the tensor cores cap you. The corner where they meet is the ridge.

Every optimization in Loom was a fight to move rightward. Coalescing and fusion cut the bytes. Tiling multiplied the reuse. Watch it happen: naive matmul starves down on the diagonal, and tiling lifts the very same operation onto the flat roof.

Drag the intensity slider, or jump to a named op. Left of the ridge is memory-bound and slow. The goal of nearly every kernel trick is to climb the slope and reach the flat.

Roofline: attainable throughput versus arithmetic intensityAt 6.0 FLOP per byte the kernel is memory-bound, reaching about 9 of 312 TFLOP/s, 3% of peak. The ridge is at 201 FLOP per byte; below it, bandwidth is the limit.0.251101001000110100arithmetic intensity (FLOP / byte)TFLOP/sridge ≈ 201compute roof · 312 TFLOP/selementwise addsoftmaxnaive matmulFlashAttentiontiled matmul

Memory-bound at 6.0 FLOP/byte, only 3% of peak. Bandwidth caps you. Reuse more data per byte loaded to climb toward the ridge.

Quick checkA fused elementwise kernel (add then GELU) runs at 4% of peak FLOPs. Is it broken?

Where this track goes next

The rest of the payoff is the library side: how production kernels put every idea from Loom to work. It is now its own page, real kernels, with a steppable FlashAttention that reuses the scenes you just played with.

The kernel-authoring ladder, from least effort to most control:

  1. PyTorch / torch.compile. Write plain tensor code; the compiler fuses it and generates Triton.
  2. Triton. A Python domain-specific language (DSL) where you write block-level programs, and the compiler handles the thread mapping, coalescing, and pipelining.
  3. CUTLASS / CuTe. C++ templates for peak tensor-core GEMM, driven by the layout algebra from Phase 6.
  4. CUDA / PTX (the low-level instruction layer). The baseline. You place threads and do the address math by hand, exactly the mental model Phases 0 to 2 built.

And the kernels themselves, as a family tree: attention (FlashAttention, then the SDPA dispatcher, FlexAttention, and the quantized SageAttention line) and mixture of experts (grouped GEMM, ScatterMoE and SonicMoE, then expert parallelism and DeepEP across many GPUs). You do not need to know these names yet. Every one is just coalescing, tiling, fusion, and tensor cores, aimed at a real model.

That is Loom. A single piece of data, followed from slow memory to the tensor core and back, with every trick that keeps the machine busy along the way. The one idea underneath all of it has not changed since Phase 0: move the bytes you need, and no more.

How we got here

Loom taught the machine as it stands today, but it did not arrive all at once. Each generation added one big idea to the same story, and every phase you played with was somebody's headline feature a few years ago.

The clearest thread is the tensor core, the unit that multiplies small matrices in one instruction. It did not exist before Volta in 2017. Since then its menu of number formats has only grown: FP16, then INT8, then BF16, FP8, and now FP4. That growing menu is the whole reason the next page exists.

Click along the timeline. Watch the precision chips accumulate as you move right, and follow each generation back to the phase where you met its trick.

Blackwell (2024, B200): Four-bit numbers and a new home for the accumulator. Tensor-core precisions available by this point: FP16, INT8, BF16 / TF32, FP8, FP4 / FP6.

Show as data
NVIDIA datacenter GPU generations and the tensor-core precision each added
GenerationYearChipNew precision
Pascal2016P100
Volta2017V100FP16
Turing2018T4INT8
Ampere2020A100BF16 / TF32
Hopper2022H100FP8
Blackwell2024B200FP4 / FP6

Two of those moves opened up whole topics of their own. The narrowing numbers became a zoo of formats: FP8, FP4, NVFP4, MXFP4, and the GGUF quants that run models on your laptop. And the matrix unit became the engine behind real kernels like FlashAttention. Each has its own page.

Why "Loom"?

Because NVIDIA got there first. A loom clamps a few hundred threads under tension, bundles them into warps, and throws data across them to weave a pattern. Squint a little and that is a GPU.

The vocabulary is lifted wholesale: threads, warps, and the tiles they weave into. Even CuTe plays along. A layout is just a weave pattern with a spec sheet.

Thirty-two threads, one shuttle, no dropped stitches.

Thread
The smallest unit of work on a GPU. One thread runs one instance of the kernel on one lane.
Warp
A group of 32 threads that execute the same instruction together, in lockstep. The scheduling unit of the GPU.
SIMT
Single Instruction, Multiple Threads. All 32 threads of a warp run one shared instruction over their own data.
Lockstep
All threads in a warp advance together on the same instruction at the same time.
Coalescing
When a warp reads neighbouring addresses so the hardware serves them in as few memory transactions as possible.
Sector
The 32-byte unit the hardware fetches from global memory. A warp wants its data packed into as few sectors as possible.
HBM
High Bandwidth Memory, the large off-chip global memory. Biggest and slowest tier, hundreds of cycles away.
Shared memory
Fast on-chip scratchpad private to one thread block. Split into 32 banks.
Register
Per-thread on-chip storage. The fastest memory, about one cycle to access.
L2 cache
On-chip cache shared by all SMs, sitting between the per-SM L1 caches and global HBM.
SM
Streaming Multiprocessor. A core building block of the GPU that runs thread blocks. An A100 has 108 of them.
Sub-partition
One of the four processing blocks inside an SM. Each has its own warp scheduler and execution units.
Warp scheduler
The unit that picks one eligible warp each cycle and issues its next instruction. Four per SM.
Eligible
A warp that is ready to issue this cycle, not waiting on memory or a dependency.
Stalled
A warp that cannot issue yet because it is waiting, usually on a memory load.
Latency hiding
Keeping the machine busy during long waits by running other ready warps while one warp stalls.
Occupancy
The number of active warps on an SM divided by the maximum it can hold. More warps give the scheduler more to switch to.
Bank
One of the 32 slots shared memory is split into. Consecutive 4-byte words map to consecutive banks (word w lands in bank w mod 32), and each bank serves one word per cycle.
Bank conflict
When two or more threads in a warp want different words in the same bank. Their reads serialize.
Tiling
Loading a small block of a matrix into shared memory once so every thread in the block reuses it many times.
Data reuse
Using a value staged in fast memory many times before fetching new data, so slow global memory is touched as little as possible.
GEMM
General matrix multiply, C = A times B. The workhorse operation behind neural networks and the main thing GPUs are tuned for.
cp.async
An asynchronous copy from global memory straight into shared memory, without stalling the thread or passing through registers.
Software pipelining
Overlapping the load of the next tile with the compute on the current one, so memory latency hides behind useful work.
Double buffering
Using two shared-memory buffers that take turns: one is being computed on while the other is being filled by the next load.
Prefetch
Starting a load early, before the data is needed, so it has arrived by the time you use it.
Thread block
A group of threads that run on one SM and share its shared memory. Also called a CTA.
Register file
The pool of registers on an SM, shared out among all resident threads. About 256 KB on an A100.
Register pressure
How many registers a kernel needs per thread. High pressure means fewer warps fit, which lowers occupancy.
Register spilling
When a thread needs more registers than it has, the extra values spill to local memory, which actually lives in slow global memory.
Local memory
Per-thread memory that, despite the name, lives in slow off-chip global memory. Registers spill here when they run out.
TMA
Tensor Memory Accelerator. A Hopper copy engine that moves whole tensor tiles between global and shared memory from a single descriptor, so one thread issues the load.
Tensor map descriptor
A small host-built struct (128 bytes) that tells TMA the tensor base, shape, strides, tile size, element type, and swizzle. One thread passes it to issue a bulk copy.
mbarrier
An asynchronous barrier in shared memory. TMA signals it when a tile lands and waiting threads wake, handing each buffer stage from producer to consumer.
wgmma
Warpgroup matrix multiply. A Hopper instruction where 128 threads issue one asynchronous tensor-core matmul that reads its operands from shared memory.
Warpgroup
Four contiguous warps, 128 threads, the granularity Hopper wgmma operates on.
Warp specialization
Giving different warps different jobs: producer warps issue TMA loads while consumer warps run wgmma, overlapping load and compute.
Thread block cluster
A Hopper group of blocks co-scheduled on one GPC that can read each other’s shared memory (distributed shared memory).
Multicast
A TMA mode that broadcasts one global load into several blocks’ shared memory in a cluster, so a shared operand crosses the bus only once.
Tensor Memory
A dedicated on-SM memory on Blackwell (256 KB) that holds the MMA accumulator, so the register file no longer has to feed the tensor cores at FP4 rates.
tcgen05
Blackwell’s 5th-generation tensor-core MMA. A single thread issues the matmul for the whole block, reading operands from shared memory and TMEM.
Microscaling
Storing a block of low-precision values (say FP4) with one shared low-precision block scale (E8M0 for MXFP4, FP8 for NVFP4), so tiny formats stay accurate. NVFP4 uses a block of 16.
FP4
A 4-bit floating-point format (E2M1). Packs eight values per 32 bits and roughly quadruples tensor-core peak over FP16, at the cost of range.
Accumulator
The running sum D in D = A × B that a matmul builds up across its K steps. Where it lives (registers or TMEM) is a recurring bottleneck.
CuTe
The layout layer under CUTLASS. Expresses thread-to-data mappings as Shape ⊗ Stride and composes, tiles, and swizzles them at compile time.
CUTLASS
NVIDIA’s open template library for peak-performance GEMM and related kernels, built on CuTe layouts.
Layout
A CuTe object, Shape ⊗ Stride, that maps a logical coordinate to a linear memory offset. Change the stride to re-lay-out data without moving it.
Stride
How far apart, in memory, consecutive elements along an axis sit. A stride of 1 means contiguous, which is what makes a read coalesce.
Swizzle
A layout that permutes shared-memory addresses so a tile reads back bank-conflict-free and in the order the tensor cores want.
Roofline
A plot of attainable throughput against arithmetic intensity. An op is memory-bound under a bandwidth roof until enough reuse lifts it to the flat compute roof.
Arithmetic intensity
FLOPs performed per byte moved from memory. Low intensity is memory-bound; tiling raises it until the op becomes compute-bound.
Exponent
A float's exponent bits set how big or small it can get. More exponent means more dynamic range.
Mantissa
A float's mantissa bits set how finely it resolves values between powers of two. More mantissa means more precision.
BF16
Brain float 16: 8 exponent bits and 7 mantissa bits. Same range as FP32 with less precision, which is why it is the training default.
FP8
An 8-bit float in two flavors: E4M3 (more precision, forward pass) and E5M2 (more range, gradients). Introduced for tensor cores on Hopper.
Quantization
Storing weights or activations in fewer bits than they were trained in, usually with a shared scale factor to recover the real magnitudes.
INT8
An 8-bit integer format with evenly spaced steps and a shared scale. Cheap and tight when a tensor has no wild outliers.
Ternary (BitNet b1.58)
Weights restricted to three values, -1, 0, and +1, about 1.58 bits each. The matmul becomes addition and subtraction with no multiplies; trained from scratch, not compressed after.
NVFP4
NVIDIA's 4-bit float (E2M1) with one FP8 scale per 16 values plus a per-tensor FP32 scale. Finer blocks than MXFP4 for better accuracy.
MXFP4
The open OCP microscaling 4-bit float: one power-of-two scale (E8M0) shared across every block of 32 values.
GGUF
The llama.cpp file format that packs a model plus metadata for local inference. Holds k-quant tensor types like Q4_K_M.
k-quant
A GGUF quantization scheme: weights in super-blocks of 256, split into sub-blocks of 32, with a two-level (super-block and sub-block) scale.
Attention
The operation that lets each token weigh every other token: score queries against keys, softmax the scores, then blend the values.
Softmax
Turns a row of numbers into positive weights that sum to 1, so they can act as attention weights or class probabilities.
Token
One chunk of a model’s input or output, roughly a word or word-piece. Sequences are measured in tokens.
Gradient
The correction signal used to update a model during training. Gradients span a huge range of magnitudes, which is why their number format needs range.
Tensor core
A dedicated unit inside the SM that multiplies a small matrix in one instruction, far faster than ordinary threads doing it multiply by multiply. First shipped on Volta.
MMA
Matrix multiply-accumulate: the tensor core’s core operation, D = A × B + C. wgmma (Hopper) and tcgen05.mma (Blackwell) are MMA instructions.
FLOP
One floating-point operation, a single multiply or add. Throughput is measured in FLOPs per second (FLOP/s).
Activation
The data flowing through a model (the inputs and intermediate results), as opposed to the fixed weights. Activations carry outliers, which makes them harder to quantize.
W4A16
A quantization recipe: 4-bit weights, 16-bit activations. Weight-only, so it saves memory but computes in 16-bit. GPTQ, AWQ, and GGUF are W4A16.
W4A4
A quantization recipe: 4-bit weights and 4-bit activations. The matmul runs on low-precision tensor cores, saving compute too, but activation outliers make it hard. NVFP4 is W4A4.
SDPA
PyTorch's scaled_dot_product_attention: a dispatcher that auto-picks a fused FlashAttention-style backend (flash, memory-efficient, or cuDNN) for your shapes and dtype.
FlexAttention
A PyTorch API for writing custom attention masks and score modifications as a small function that still compiles to one fused FlashAttention-style kernel.
SageAttention
Quantized attention: runs the attention matmuls in INT8 or FP4 on tensor cores, smoothing outlier channels first. The W4A4 idea applied to attention.
SonicMoE
A mixture-of-experts kernel library that packs routed tokens into contiguous groups so each expert’s grouped-GEMM tile is full and efficient.
Data parallel
Replicate the whole model on every GPU, split the batch, and average gradients with an all-reduce. DDP is the efficient PyTorch version.
FSDP / ZeRO
Sharded data parallel: split the batch AND shard params, gradients, and optimizer state across GPUs, gathering each layer just in time. Trades communication for memory.
Tensor parallel
Split each layer's weight matrices across GPUs (Megatron splits heads and MLP columns) and all-reduce the activations. Chatty, so it wants NVLink.
Pipeline parallel
Put different layers on different GPUs as stages; activations flow stage to stage. The idle time while the pipeline fills and drains is the "bubble."
Expert parallel
Spread a mixture-of-experts layer’s experts across GPUs and all-to-all the tokens to wherever their expert lives. DeepEP accelerates the shuffle.
Context / sequence parallel
Split the sequence across GPUs. Sequence parallel saves activation memory on the non-matmul parts; context parallel (ring attention) splits attention over ultra-long contexts.
All-reduce
A collective that sums a value across all GPUs and hands every GPU the total. Used to average gradients (data parallel) and combine activations (tensor parallel).
All-to-all
A collective where every GPU sends a different piece to every other GPU. Used to route tokens to their expert in expert parallelism.
KV cache
The stored Keys and Values of every past token, kept so they are not recomputed each step. It dominates long-context memory and grows with the number of KV heads, head size, layers, and sequence length.
RoPE
Rotary position embedding: encodes position by rotating the query and key vectors by an angle proportional to position, so the attention score depends on relative distance. Applied every layer, not added to embeddings.
NoPE
No positional encoding: a decoder-only causal model can infer position from the causal mask alone (a counting signal), with no explicit position input. Works only in the causal setting and has a finite usable range.
Sliding window attention
Each token attends only to the last W tokens (a local band), capping cost and local KV cache. Stacking layers still compounds the reach, so the model is not limited to W.
Attention sink
A learned bias that keeps the always-important first tokens in the softmax denominator, letting a very small sliding window (like gpt-oss’s 128) stay stable.
MHA
Multi-head attention: every query head has its own Key and Value head. Best quality, largest KV cache.
MQA
Multi-query attention: all query heads share a single Key/Value head. Smallest KV cache, but the hard sharing can cost quality.
GQA
Grouped-query attention: query heads are split into groups, each sharing one Key/Value head. Near-MHA quality at close to MQA memory, and the mainstream default.
MLA
Multi-head latent attention (DeepSeek): compress every head's K and V into one small shared latent, cache only that, and reconstruct per-head K/V on the fly. GQA-level cache at MHA quality, with a small decoupled RoPE key.
DSA
DeepSeek Sparse Attention (V3.2): a lightning indexer scores past tokens and each query attends only to its top-k (2048), turning attention from order N squared into order N times k. Built on MLA.
YaRN
A RoPE context-extension method: scale the rotation frequencies (NTK-style) and add an attention-temperature correction, so a model trained at one length works at a longer one.