Loom

Follow the data through the GPU. A warp is 32 threads run in lockstep. The word comes straight from weaving, where the warp is the set of threads held in parallel on a loom.

Loom follows a single piece of data on its trip through an NVIDIA GPU, from the slow memory far away to the math units that crunch it, and shows you why the fast code is fast.

No CUDA required. Every idea here is a picture you can play with. Drag a slider, press play, predict what happens, then reveal it. Small quizzes and mnemonics help it stick.

What's inside

Phase 0 Coalescing. Watch a warp load from global memory. See why reading neighbouring addresses is fast and scattering is slow.
Phase 1 The SM. Meet the warp scheduler that hides memory latency, then beat a shared-memory bank conflict.
Phase 2 Tiled matmul. Put it together: tile for reuse, pipeline with cp.async, and pad away the last conflict.
Phase 3 Occupancy. How many warps actually fit, and why registers and shared memory pull against each other.
Phase 4-5 Hopper & Blackwell. Re-choreograph the same matmul: one thread fires a TMA load, warps specialize, and the numbers shrink to FP4.
Phase 6 CuTe layouts. The capstone: the Shape × Stride algebra that every kernel is really written in.
Advanced Real kernels. The roofline, and how FlashAttention and MoE put it all to work.

Two side trips branch off the main path once you have the basics: the number zoo (BF16 to FP4, NVFP4, and the GGUF quants on your laptop) and real kernels (FlashAttention, DeepGEMM, MoE up close).

By the end you will know the vocabulary the rest of the field assumes, warp, coalescing, occupancy, tiling, and the one idea underneath all of it: move the bytes you need, and no more.

None of the unfamiliar words above need to mean anything yet. Each is unpacked from scratch when you reach it. Start at the top and scroll. Each scene builds on the one before it.

Memory coalescing, one warp at a time

Every GPU performance story starts here.

A warp is a group of 32 threads that run together in lockstep. Each of the 32 loads a word (4 bytes) from global memory. The hardware does not fetch 32 little pieces. It fetches aligned 32-byte sectors and hands each thread its slice.

Read neighbouring words and a whole sector is useful. That is a coalesced load. Stride apart and each sector arrives carrying mostly data nobody asked for.

Start small. Set the warp to 1 thread, press play, then step up to 4 and then 32. Then flip to strided and predict the damage before you reveal it.

coalesced access, 32 of 32 lanes active: 4 sectors fetched vs 4 ideal, 100% bus utilization.

Lanes (kiddy pool → full warp)

Access pattern

Timeline

Show this scene as data (screen-reader & LLM friendly)

Pattern: coalesced. Active lanes: 32. Sectors fetched: 4 (ideal 4). Useful bytes 128 of 128 fetched, giving 100% bus utilization.

Per-lane memory access for the current pattern
Lane	Word	Byte address	Sector
0	0	0–3	0
1	1	4–7	0
2	2	8–11	0
3	3	12–15	0
4	4	16–19	0
5	5	20–23	0
6	6	24–27	0
7	7	28–31	0
8	8	32–35	1
9	9	36–39	1
10	10	40–43	1
11	11	44–47	1
12	12	48–51	1
13	13	52–55	1
14	14	56–59	1
15	15	60–63	1
16	16	64–67	2
17	17	68–71	2
18	18	72–75	2
19	19	76–79	2
20	20	80–83	2
21	21	84–87	2
22	22	88–91	2
23	23	92–95	2
24	24	96–99	3
25	25	100–103	3
26	26	104–107	3
27	27	108–111	3
28	28	112–115	3
29	29	116–119	3
30	30	120–123	3
31	31	124–127	3

Why the strided case hurts

A coalesced 32-lane load needs exactly four sectors. 128 bytes in, 128 bytes used, 100% of the bus doing real work.

The strided load touches a fresh sector for every lane. That is 32 sectors, 1024 bytes moved to deliver the same 128 useful bytes.

Same data, eight times the traffic. Eight times the wait on the slowest memory in the machine.

That one idea, move the bytes you need and no more, is the seed of tiling, cp.async, TMA, and everything Loom builds next.

So who does the waiting?

A global load takes hundreds of cycles.

If a thread just sat and waited, the machine would idle most of the time. It does not.

On an A100 or H100, an SM keeps up to 64 warps resident at once. Its four warp schedulers each issue one instruction per cycle from an eligible warp, one that is not currently waiting on memory.

While one warp is stalled on memory, the scheduler selects another. That is latency hiding, and it is why GPUs want lots of threads.

Switching costs nothing. Every warp's registers stay live in hardware.

Drag the warp count up from 1 and watch the idle time vanish. Amber cells are warps stalled on memory. Teal is the one the scheduler selected this cycle. With too few warps it hits bubbles, cycles where nobody is eligible.

Resident warps: 2 / 16 per scheduler

31% issued

One scheduler shown (an A100/H100 SM has 4, up to 16 warps each). Cycle counts are compressed for visibility. A real global load is ~400–800 cycles, which is exactly why an SM keeps up to 64 warps ready.

2 resident warps on one scheduler: it selected an eligible warp on 31% of cycles. In this compressed toy timeline about 6 warps hide the memory latency; a real global load is hundreds of cycles, which is why an SM holds up to 16 warps per scheduler (64 per SM).

Your turn: kill the bank conflict

Shared memory is fast, but it is split into 32 banks, and a warp can read only one word per bank per cycle. Consecutive 4-byte words are spread across the banks in order, so word w sits in bank w mod 32, wrapping back to bank 0 every 32 words. Two threads that want different words in the same bank are a conflict, and the reads serialize.

Hit 32 different banks and the whole warp reads at once. Collide, and the read serializes into one transaction per pile-up.

Below, 32 threads read smem[thread × stride]. Find the stride that keeps all 32 banks distinct.

Puzzle. 32 threads read shared memory as smem[thread × stride]. Pick the stride that sends all 32 threads to 32 different banks, so the whole warp reads in a single cycle.

16 of 32 banks used · 2-way

Stride 2: 2 threads collide on each used bank. That is a 2-way conflict, so this read serializes into 2 transactions. Even strides share a factor with 32, so threads pile up.

Show as data

Thread → word → bank at stride 2
Thread	Word	Bank
0	0	0
1	2	2
2	4	4
3	6	6
4	8	8
5	10	10
6	12	12
7	14	14
8	16	16
9	18	18
10	20	20
11	22	22
12	24	24
13	26	26
14	28	28
15	30	30
16	32	0
17	34	2
18	36	4
19	38	6
20	40	8
21	42	10
22	44	12
23	46	14
24	48	16
25	50	18
26	52	20
27	54	22
28	56	24
29	58	26
30	60	28
31	62	30

Follow the data: a tiled matmul

Now we put the pieces together on the job GPUs are built for: matrix multiply.

Each output number is a row times a column. Done naively, every thread streams a whole row and column from global memory, and neighbours re-read the exact same values.

That is the trap. The math is cheap, but the machine spends all its time waiting on memory. Matmul is memory-bound until you fix the reuse.

The fix is tiling. Load a small tile of A and a tile of B into shared memory once, then let every thread in the block (the group of warps that runs together on one SM and shares its fast memory) reuse them many times before moving on.

Click any output tile in C. It is a running sum. Sweep K and watch it pull one tile of A and one tile of B per step, add them in, and fill up.

Output tile row 1, column 1: step 1 of 4, staging A tile (row 1, col 0) and B tile (row 0, col 1). 25% accumulated.

Each number staged in shared memory is reused 32×. That turns 65,536 global reads into 2,048.

Don't wait for the load. Prefetch it.

Tiling cut the traffic. But each tile still has to arrive before you can use it.

The naive loop stalls on every load: fetch a tile, sit idle, multiply, repeat. The slow memory shows up as dead time between every step.

cp.async changes that. It copies a tile from global straight into shared memory asynchronously, without parking the thread and without going through registers. An ordinary load routes each word global → register → shared, tying up the thread and a register the whole time; cp.async skips straight to shared, so the thread can keep working.

So while you multiply the current tile, you prefetch the next one. Two buffers, taking turns. The load hides behind the compute.

Run both schedules on one clock. Same work, same tiles. Watch the pipelined run cross the finish line while the synchronous one is still stalling.

Same work, two schedules. Watch the pipeline finish first.

Boss: the tile fights back

The tile is staged in shared memory. Now the threads read a column of it to do the multiply.

Remember the 32 banks from Phase 1? Stepping down a column jumps a full row, 32 elements, each time. With 32 banks, that stride sends every thread in a column to the same bank. That is a 32-way conflict, the worst case.

Fix it with the oldest trick in the book: a little padding.

Boss puzzle. Your staged tile is 32 columns wide. 32 threads read one column, smem[thread × width]. Add the smallest padding that gives all 32 threads their own bank.

width 32 · 1 of 32 banks · 32-way

Width 32: 32 threads pile onto each used bank, a 32-way conflict. The read serializes into 32 transactions. An even width shares a factor with 32, so threads collapse onto the same banks.

Show as data

Thread → word → bank reading a column at width 32
Thread	Word	Bank
0	0	0
1	32	0
2	64	0
3	96	0
4	128	0
5	160	0
6	192	0
7	224	0
8	256	0
9	288	0
10	320	0
11	352	0
12	384	0
13	416	0
14	448	0
15	480	0
16	512	0
17	544	0
18	576	0
19	608	0
20	640	0
21	672	0
22	704	0
23	736	0
24	768	0
25	800	0
26	832	0
27	864	0
28	896	0
29	928	0
30	960	0
31	992	0

That is the Ampere playbook: tile for reuse, pipeline with cp.async, and pad to keep shared memory conflict-free. Hopper and Blackwell rebuild this same dataflow with new hardware. That is where Loom goes next.

How many warps actually fit?

Phase 1 sold you on warps: the more you have, the better the scheduler hides latency.

So why not run the maximum every time? Because warps are not free.

Each SM has a fixed budget: one register file and one slab of shared memory. Every thread's registers come out of that file. Every block's shared memory comes out of that slab.

Ask for more registers per thread, and fewer warps fit. Ask for more shared memory per block, and fewer blocks fit. Occupancy is the fraction of the maximum warps you manage to keep resident.

Drag the sliders. Push registers per thread up and watch the warps vanish. The meter that turns amber is the resource that ran out first.

64 / 64 warps 100% occupancy

Shared memory 78%

Registers / thread 32 Shared memory / block 16 KB Threads / block 256

✓ 64 warps resident. Plenty for the scheduler to hide latency. Limited by registers.

This is the hinge between the last two phases. Latency hiding (Phase 1) wants many warps. Tiling (Phase 2) spends shared memory to get reuse. Both pull on the same fixed budget, and occupancy is where you balance them.

Hopper: let one thread do the load

Phase 2's Ampere loop worked, but look at who was doing the loading. Every one of the block's threads computed its own addresses and fired a cp.async. That address arithmetic burns registers, the very budget Phase 3 showed is scarce.

Hopper adds a dedicated copy engine, the TMA (Tensor Memory Accelerator). Now a single thread hands it a descriptor, a small struct that says where the tensor lives, its shape and strides, the tile to grab, and how to swizzle it (permute the shared-memory addresses so the load lands conflict-free). The engine does all the address math.

When the tile lands, TMA flips an mbarrier, a hardware flag threads can wait on, to wake the waiting threads. The other 127 never touched an address. They are free to compute.

Flip between Ampere and Hopper. Watch the block go from every thread grinding on address math to one thread firing a descriptor and the rest set free.

The thread block · 128 threads

all 128 on address math

128 threads each compute addresses → cp.async → shared memory

~32 registers/thread spent on address math

Ampere: all 128 threads compute their own addresses and fire cp.async, spending roughly 32 registers each on address arithmetic.

Split the warps: producer and consumer

With TMA doing the copy, you can give whole warps different jobs. This is warp specialization. One group of warps, the producer, does nothing but fire TMA loads into a multi-stage shared-memory buffer. Another group, the consumer, does nothing but run wgmma, the warpgroup matrix multiply, draining that buffer. A warpgroup is four warps, 128 threads, that together issue one instruction to the tensor cores: the dedicated units inside the SM that multiply a small matrix in a single step, far faster than threads doing it one multiply at a time. Every phase from here runs on them.

An mbarrier per stage hands each tile from producer to consumer. Because the buffer is a few stages deep, the producer can run several tiles ahead. Once it is primed, the tensor cores never wait.

Run the pipeline. The amber lane is TMA loading tiles; the teal lane is wgmma multiplying them. After the buffer fills, the teal lane runs solid to the end. That is a saturated tensor core.

Priming the buffer. Producer fills, consumer waits for the first tile.

Same GEMM, new machine. The bytes still flow global → shared → tensor core, but Hopper moves the address math into hardware and lets warps specialize. Blackwell goes one step further and moves the accumulator out of the registers too. That is next.

Blackwell: make the numbers smaller

Every phase so far has moved bytes faster. Blackwell also makes each number smaller. FP8 and FP4 are ordinary floating-point numbers squeezed into 8 or 4 bits instead of the usual 16. A matmul in FP8 moves half the bytes of FP16 and runs the tensor cores about twice as fast. FP4 halves it again.

The catch is accuracy: four bits can barely tell numbers apart. The fix is microscaling. Store a block of values in FP4, but share one small scale factor (an 8-bit number, FP8 in NVFP4) across every 16 of them, plus one full-precision scale for the whole tensor. NVIDIA's NVFP4 does exactly this.

Pick a format. Watch how many values fit in one 32-byte chunk and how the tensor-core peak grows as the numbers shrink. For FP4, the amber ticks mark each shared scale.

One 32-byte chunk holds 16 FP16 values

Tensor-core peak vs FP16

1×

half precision, the baseline

FP16 is 16 bits per value, so 16 values pack into one 32-byte chunk and the tensor core runs about 1x FP16 peak.

Show as data

Format width, packing, and relative peak
Format	Bits	Values / 32 B	Peak vs FP16
FP16	16	16	1×
FP8	8	32	2×
FP6	6	42	2.6666666666666665×
FP4	4	64	4×

Move the accumulator out of the registers

Here is the surprise. Feed the tensor cores at FP4 rates and the bottleneck stops being memory. It becomes the register file itself. The accumulator, the running sum D += A × B, cannot be read and written fast enough from registers.

So Blackwell gives it a dedicated home: Tensor Memory, or TMEM. That is 256 KB per SM, with its own bandwidth. The accumulator lives there for the whole matmul instead of clogging the registers.

And the instruction that drives it, tcgen05.mma (an MMA, matrix multiply-accumulate, the tensor core's core operation D = A × B + C), is issued by a single thread on behalf of the entire block. On Hopper 128 threads cooperated on a wgmma; on Blackwell one thread fires the MMA and the tensor core reads straight from shared memory and TMEM. A pair of SMs can even share one MMA, each staging half the operands.

Step the precision down and watch the accumulator's bandwidth demand climb. Past FP8 it crosses the register-file limit, and the accumulator hops to TMEM.

Accumulator bandwidth demand (× FP16)

1× the FP16 rate

At FP16, the accumulator needs about 1x the bandwidth it did at FP16. The register file can still supply that, so the accumulator stays in registers.

Schematic: the ceiling is illustrative, not a measured bandwidth.

That closes the hardware tour: coalescing, the scheduler, tiling, occupancy, TMA, and now tensor memory. One last idea ties the thread-to-data mapping together, the algebra the kernels are actually written in. That is the capstone: CuTe layouts.

Capstone: the layout is the whole trick

Look back at the journey. Coalescing was about which address each thread reads. Tiling was about where a tile sits in shared memory. Swizzling and padding reshuffled that placement to dodge bank conflicts. TMA carried a descriptor of it. Every one of those is the same thing: a mapping from a logical coordinate to a memory offset.

CuTe, the foundation of CUTLASS (NVIDIA's library for fast matmul kernels), writes that mapping as one object: Layout = Shape ⊗ Stride. The shape says how big each axis is. The stride says how far apart its elements sit. Feed it a coordinate (i, j) and it returns an offset, i·strideᵢ + j·strideⱼ. That is the entire idea.

The power is that shape and stride are independent. Keep the same logical grid and change only the strides, and the data is suddenly row-major, column-major, or padded, without a single element moving.

Flip the layout between row-major and column-major and watch every offset renumber. Then choose whether a warp reads a row or a column, and the coalescing verdict from Phase 0 drops straight out of the stride.

Layout

A warp reads a

M 4 N 8

Layout = Shape ⊗ Stride = (4, 8) ⊗ (8, 1) · offset = i·8 + j·1

Cells are shaded by memory offset. Click any cell to move the read line.

✓ Consecutive threads are 1 apart: offsets 0, 1, 2, 3, 4, 5, 6, 7. One coalesced transaction, the fast path from Phase 0.

Show as data

Thread order → logical coord → memory offset for the selected row
Thread	Coord	Offset
0	(0, 0)	0
1	(0, 1)	1
2	(0, 2)	2
3	(0, 3)	3
4	(0, 4)	4
5	(0, 5)	5
6	(0, 6)	6
7	(0, 7)	7

This is why real kernels are written in CuTe. Tensor cores demand exact, swizzled thread-to-data mappings for TMA and wgmma, and hand-deriving them is a footgun. The layout algebra composes and tiles them at compile time, portably across Ampere, Hopper, and Blackwell.

You now have the whole machine and the language its kernels speak. The next track flips from the hardware to the software: how production kernels like FlashAttention, DeepGEMM, and the MoE libraries put every one of these ideas to work. It starts with the one picture that governs all of them, the roofline.

The payoff: one picture for every kernel

You have the machine. Now the question every real kernel asks: am I limited by the math or by the memory? The roofline answers it with a single plot.

The x-axis is arithmetic intensity, how many FLOPs (floating-point operations, the multiplies and adds) you do per byte you move. The roof has two parts. A sloped bandwidth roof on the left: if you touch memory a lot per flop, bandwidth caps you. A flat compute roof on the right: once you reuse enough, the tensor cores cap you. The corner where they meet is the ridge.

Every optimization in Loom was a fight to move rightward. Coalescing and fusion cut the bytes. Tiling multiplied the reuse. Watch it happen: naive matmul starves down on the diagonal, and tiling lifts the very same operation onto the flat roof.

Drag the intensity slider, or jump to a named op. Left of the ridge is memory-bound and slow. The goal of nearly every kernel trick is to climb the slope and reach the flat.

arithmetic intensity 6.0 FLOP/byte

⚠ Memory-bound at 6.0 FLOP/byte, only 3% of peak. Bandwidth caps you. Reuse more data per byte loaded to climb toward the ridge.

Where this track goes next

The rest of the payoff is the library side: how production kernels put every idea from Loom to work. It is now its own page, real kernels, with a steppable FlashAttention that reuses the scenes you just played with.

The kernel-authoring ladder, from least effort to most control:

PyTorch / torch.compile. Write plain tensor code; the compiler fuses it and generates Triton.
Triton. A Python domain-specific language (DSL) where you write block-level programs, and the compiler handles the thread mapping, coalescing, and pipelining.
CUTLASS / CuTe. C++ templates for peak tensor-core GEMM, driven by the layout algebra from Phase 6.
CUDA / PTX (the low-level instruction layer). The baseline. You place threads and do the address math by hand, exactly the mental model Phases 0 to 2 built.

And the kernels themselves, as a family tree: attention (FlashAttention, then the SDPA dispatcher, FlexAttention, and the quantized SageAttention line) and mixture of experts (grouped GEMM, ScatterMoE and SonicMoE, then expert parallelism and DeepEP across many GPUs). You do not need to know these names yet. Every one is just coalescing, tiling, fusion, and tensor cores, aimed at a real model.

That is Loom. A single piece of data, followed from slow memory to the tensor core and back, with every trick that keeps the machine busy along the way. The one idea underneath all of it has not changed since Phase 0: move the bytes you need, and no more.

How we got here

Loom taught the machine as it stands today, but it did not arrive all at once. Each generation added one big idea to the same story, and every phase you played with was somebody's headline feature a few years ago.

The clearest thread is the tensor core, the unit that multiplies small matrices in one instruction. It did not exist before Volta in 2017. Since then its menu of number formats has only grown: FP16, then INT8, then BF16, FP8, and now FP4. That growing menu is the whole reason the next page exists.

Click along the timeline. Watch the precision chips accumulate as you move right, and follow each generation back to the phase where you met its trick.

Blackwell B200 2024

Four-bit numbers and a new home for the accumulator.

FP4 and FP6 tensor cores with hardware microscaling (NVFP4), so accuracy survives four bits.
Tensor Memory (TMEM): the accumulator moves out of the register file, which could no longer keep up.
tcgen05 lets a single thread issue a matmul for the whole block, and a pair of SMs can share one.

You met this in Phase 5 · Blackwell →

Blackwell (2024, B200): Four-bit numbers and a new home for the accumulator. Tensor-core precisions available by this point: FP16, INT8, BF16 / TF32, FP8, FP4 / FP6.

Show as data

NVIDIA datacenter GPU generations and the tensor-core precision each added
Generation	Year	Chip	New precision
Pascal	2016	P100	—
Volta	2017	V100	FP16
Turing	2018	T4	INT8
Ampere	2020	A100	BF16 / TF32
Hopper	2022	H100	FP8
Blackwell	2024	B200	FP4 / FP6

Two of those moves opened up whole topics of their own. The narrowing numbers became a zoo of formats: FP8, FP4, NVFP4, MXFP4, and the GGUF quants that run models on your laptop. And the matrix unit became the engine behind real kernels like FlashAttention. Each has its own page.

Thread: The smallest unit of work on a GPU. One thread runs one instance of the kernel on one lane.
Warp: A group of 32 threads that execute the same instruction together, in lockstep. The scheduling unit of the GPU.
SIMT: Single Instruction, Multiple Threads. All 32 threads of a warp run one shared instruction over their own data.
Lockstep: All threads in a warp advance together on the same instruction at the same time.
Coalescing: When a warp reads neighbouring addresses so the hardware serves them in as few memory transactions as possible.
Sector: The 32-byte unit the hardware fetches from global memory. A warp wants its data packed into as few sectors as possible.
HBM: High Bandwidth Memory, the large off-chip global memory. Biggest and slowest tier, hundreds of cycles away.
Shared memory: Fast on-chip scratchpad private to one thread block. Split into 32 banks.
Register: Per-thread on-chip storage. The fastest memory, about one cycle to access.
L2 cache: On-chip cache shared by all SMs, sitting between the per-SM L1 caches and global HBM.
SM: Streaming Multiprocessor. A core building block of the GPU that runs thread blocks. An A100 has 108 of them.
Sub-partition: One of the four processing blocks inside an SM. Each has its own warp scheduler and execution units.
Warp scheduler: The unit that picks one eligible warp each cycle and issues its next instruction. Four per SM.
Eligible: A warp that is ready to issue this cycle, not waiting on memory or a dependency.
Stalled: A warp that cannot issue yet because it is waiting, usually on a memory load.
Latency hiding: Keeping the machine busy during long waits by running other ready warps while one warp stalls.
Occupancy: The number of active warps on an SM divided by the maximum it can hold. More warps give the scheduler more to switch to.
Bank: One of the 32 slots shared memory is split into. Consecutive 4-byte words map to consecutive banks (word w lands in bank w mod 32), and each bank serves one word per cycle.
Bank conflict: When two or more threads in a warp want different words in the same bank. Their reads serialize.
Tiling: Loading a small block of a matrix into shared memory once so every thread in the block reuses it many times.
Data reuse: Using a value staged in fast memory many times before fetching new data, so slow global memory is touched as little as possible.
GEMM: General matrix multiply, C = A times B. The workhorse operation behind neural networks and the main thing GPUs are tuned for.
cp.async: An asynchronous copy from global memory straight into shared memory, without stalling the thread or passing through registers.
Software pipelining: Overlapping the load of the next tile with the compute on the current one, so memory latency hides behind useful work.
Double buffering: Using two shared-memory buffers that take turns: one is being computed on while the other is being filled by the next load.
Prefetch: Starting a load early, before the data is needed, so it has arrived by the time you use it.
Thread block: A group of threads that run on one SM and share its shared memory. Also called a CTA.
Register file: The pool of registers on an SM, shared out among all resident threads. About 256 KB on an A100.
Register pressure: How many registers a kernel needs per thread. High pressure means fewer warps fit, which lowers occupancy.
Register spilling: When a thread needs more registers than it has, the extra values spill to local memory, which actually lives in slow global memory.
Local memory: Per-thread memory that, despite the name, lives in slow off-chip global memory. Registers spill here when they run out.
TMA: Tensor Memory Accelerator. A Hopper copy engine that moves whole tensor tiles between global and shared memory from a single descriptor, so one thread issues the load.
Tensor map descriptor: A small host-built struct (128 bytes) that tells TMA the tensor base, shape, strides, tile size, element type, and swizzle. One thread passes it to issue a bulk copy.
mbarrier: An asynchronous barrier in shared memory. TMA signals it when a tile lands and waiting threads wake, handing each buffer stage from producer to consumer.
wgmma: Warpgroup matrix multiply. A Hopper instruction where 128 threads issue one asynchronous tensor-core matmul that reads its operands from shared memory.
Warpgroup: Four contiguous warps, 128 threads, the granularity Hopper wgmma operates on.
Warp specialization: Giving different warps different jobs: producer warps issue TMA loads while consumer warps run wgmma, overlapping load and compute.
Thread block cluster: A Hopper group of blocks co-scheduled on one GPC that can read each other’s shared memory (distributed shared memory).
Multicast: A TMA mode that broadcasts one global load into several blocks’ shared memory in a cluster, so a shared operand crosses the bus only once.
Tensor Memory: A dedicated on-SM memory on Blackwell (256 KB) that holds the MMA accumulator, so the register file no longer has to feed the tensor cores at FP4 rates.
tcgen05: Blackwell’s 5th-generation tensor-core MMA. A single thread issues the matmul for the whole block, reading operands from shared memory and TMEM.
Microscaling: Storing a block of low-precision values (say FP4) with one shared low-precision block scale (E8M0 for MXFP4, FP8 for NVFP4), so tiny formats stay accurate. NVFP4 uses a block of 16.
FP4: A 4-bit floating-point format (E2M1). Packs eight values per 32 bits and roughly quadruples tensor-core peak over FP16, at the cost of range.
Accumulator: The running sum D in D = A × B that a matmul builds up across its K steps. Where it lives (registers or TMEM) is a recurring bottleneck.
CuTe: The layout layer under CUTLASS. Expresses thread-to-data mappings as Shape ⊗ Stride and composes, tiles, and swizzles them at compile time.
CUTLASS: NVIDIA’s open template library for peak-performance GEMM and related kernels, built on CuTe layouts.
Layout: A CuTe object, Shape ⊗ Stride, that maps a logical coordinate to a linear memory offset. Change the stride to re-lay-out data without moving it.
Stride: How far apart, in memory, consecutive elements along an axis sit. A stride of 1 means contiguous, which is what makes a read coalesce.
Swizzle: A layout that permutes shared-memory addresses so a tile reads back bank-conflict-free and in the order the tensor cores want.
Roofline: A plot of attainable throughput against arithmetic intensity. An op is memory-bound under a bandwidth roof until enough reuse lifts it to the flat compute roof.
Arithmetic intensity: FLOPs performed per byte moved from memory. Low intensity is memory-bound; tiling raises it until the op becomes compute-bound.
Exponent: A float's exponent bits set how big or small it can get. More exponent means more dynamic range.
Mantissa: A float's mantissa bits set how finely it resolves values between powers of two. More mantissa means more precision.
BF16: Brain float 16: 8 exponent bits and 7 mantissa bits. Same range as FP32 with less precision, which is why it is the training default.
FP8: An 8-bit float in two flavors: E4M3 (more precision, forward pass) and E5M2 (more range, gradients). Introduced for tensor cores on Hopper.
Quantization: Storing weights or activations in fewer bits than they were trained in, usually with a shared scale factor to recover the real magnitudes.
INT8: An 8-bit integer format with evenly spaced steps and a shared scale. Cheap and tight when a tensor has no wild outliers.
Ternary (BitNet b1.58): Weights restricted to three values, -1, 0, and +1, about 1.58 bits each. The matmul becomes addition and subtraction with no multiplies; trained from scratch, not compressed after.
NVFP4: NVIDIA's 4-bit float (E2M1) with one FP8 scale per 16 values plus a per-tensor FP32 scale. Finer blocks than MXFP4 for better accuracy.
MXFP4: The open OCP microscaling 4-bit float: one power-of-two scale (E8M0) shared across every block of 32 values.
GGUF: The llama.cpp file format that packs a model plus metadata for local inference. Holds k-quant tensor types like Q4_K_M.
k-quant: A GGUF quantization scheme: weights in super-blocks of 256, split into sub-blocks of 32, with a two-level (super-block and sub-block) scale.
Attention: The operation that lets each token weigh every other token: score queries against keys, softmax the scores, then blend the values.
Softmax: Turns a row of numbers into positive weights that sum to 1, so they can act as attention weights or class probabilities.
Token: One chunk of a model’s input or output, roughly a word or word-piece. Sequences are measured in tokens.
Gradient: The correction signal used to update a model during training. Gradients span a huge range of magnitudes, which is why their number format needs range.
Tensor core: A dedicated unit inside the SM that multiplies a small matrix in one instruction, far faster than ordinary threads doing it multiply by multiply. First shipped on Volta.
MMA: Matrix multiply-accumulate: the tensor core’s core operation, D = A × B + C. wgmma (Hopper) and tcgen05.mma (Blackwell) are MMA instructions.
FLOP: One floating-point operation, a single multiply or add. Throughput is measured in FLOPs per second (FLOP/s).
Activation: The data flowing through a model (the inputs and intermediate results), as opposed to the fixed weights. Activations carry outliers, which makes them harder to quantize.
W4A16: A quantization recipe: 4-bit weights, 16-bit activations. Weight-only, so it saves memory but computes in 16-bit. GPTQ, AWQ, and GGUF are W4A16.
W4A4: A quantization recipe: 4-bit weights and 4-bit activations. The matmul runs on low-precision tensor cores, saving compute too, but activation outliers make it hard. NVFP4 is W4A4.
SDPA: PyTorch's scaled_dot_product_attention: a dispatcher that auto-picks a fused FlashAttention-style backend (flash, memory-efficient, or cuDNN) for your shapes and dtype.
FlexAttention: A PyTorch API for writing custom attention masks and score modifications as a small function that still compiles to one fused FlashAttention-style kernel.
SageAttention: Quantized attention: runs the attention matmuls in INT8 or FP4 on tensor cores, smoothing outlier channels first. The W4A4 idea applied to attention.
SonicMoE: A mixture-of-experts kernel library that packs routed tokens into contiguous groups so each expert’s grouped-GEMM tile is full and efficient.
Data parallel: Replicate the whole model on every GPU, split the batch, and average gradients with an all-reduce. DDP is the efficient PyTorch version.
FSDP / ZeRO: Sharded data parallel: split the batch AND shard params, gradients, and optimizer state across GPUs, gathering each layer just in time. Trades communication for memory.
Tensor parallel: Split each layer's weight matrices across GPUs (Megatron splits heads and MLP columns) and all-reduce the activations. Chatty, so it wants NVLink.
Pipeline parallel: Put different layers on different GPUs as stages; activations flow stage to stage. The idle time while the pipeline fills and drains is the "bubble."
Expert parallel: Spread a mixture-of-experts layer’s experts across GPUs and all-to-all the tokens to wherever their expert lives. DeepEP accelerates the shuffle.
Context / sequence parallel: Split the sequence across GPUs. Sequence parallel saves activation memory on the non-matmul parts; context parallel (ring attention) splits attention over ultra-long contexts.
All-reduce: A collective that sums a value across all GPUs and hands every GPU the total. Used to average gradients (data parallel) and combine activations (tensor parallel).
All-to-all: A collective where every GPU sends a different piece to every other GPU. Used to route tokens to their expert in expert parallelism.
KV cache: The stored Keys and Values of every past token, kept so they are not recomputed each step. It dominates long-context memory and grows with the number of KV heads, head size, layers, and sequence length.
RoPE: Rotary position embedding: encodes position by rotating the query and key vectors by an angle proportional to position, so the attention score depends on relative distance. Applied every layer, not added to embeddings.
NoPE: No positional encoding: a decoder-only causal model can infer position from the causal mask alone (a counting signal), with no explicit position input. Works only in the causal setting and has a finite usable range.
Sliding window attention: Each token attends only to the last W tokens (a local band), capping cost and local KV cache. Stacking layers still compounds the reach, so the model is not limited to W.
Attention sink: A learned bias that keeps the always-important first tokens in the softmax denominator, letting a very small sliding window (like gpt-oss’s 128) stay stable.
MHA: Multi-head attention: every query head has its own Key and Value head. Best quality, largest KV cache.
MQA: Multi-query attention: all query heads share a single Key/Value head. Smallest KV cache, but the hard sharing can cost quality.
GQA: Grouped-query attention: query heads are split into groups, each sharing one Key/Value head. Near-MHA quality at close to MQA memory, and the mainstream default.
MLA: Multi-head latent attention (DeepSeek): compress every head's K and V into one small shared latent, cache only that, and reconstruct per-head K/V on the fly. GQA-level cache at MHA quality, with a small decoupled RoPE key.
DSA: DeepSeek Sparse Attention (V3.2): a lightning indexer scores past tokens and each query attends only to its top-k (2048), turning attention from order N squared into order N times k. Built on MLA.
YaRN: A RoPE context-extension method: scale the rotation frequencies (NTK-style) and add an attention-temperature correction, so a model trained at one length works at a longer one.

Thread	Word	Bank
0	0	0
1	2	2
2	4	4
3	6	6
4	8	8
5	10	10
6	12	12
7	14	14
8	16	16
9	18	18
10	20	20
11	22	22
12	24	24
13	26	26
14	28	28
15	30	30
16	32	0
17	34	2
18	36	4
19	38	6
20	40	8
21	42	10
22	44	12
23	46	14
24	48	16
25	50	18
26	52	20
27	54	22
28	56	24
29	58	26
30	60	28
31	62	30

Thread	Word	Bank
0	0	0
1	32	0
2	64	0
3	96	0
4	128	0
5	160	0
6	192	0
7	224	0
8	256	0
9	288	0
10	320	0
11	352	0
12	384	0
13	416	0
14	448	0
15	480	0
16	512	0
17	544	0
18	576	0
19	608	0
20	640	0
21	672	0
22	704	0
23	736	0
24	768	0
25	800	0
26	832	0
27	864	0
28	896	0
29	928	0
30	960	0
31	992	0

Thread	Word	Bank
0	0	0
1	2	2
2	4	4
3	6	6
4	8	8
5	10	10
6	12	12
7	14	14
8	16	16
9	18	18
10	20	20
11	22	22
12	24	24
13	26	26
14	28	28
15	30	30
16	32	0
17	34	2
18	36	4
19	38	6
20	40	8
21	42	10
22	44	12
23	46	14
24	48	16
25	50	18
26	52	20
27	54	22
28	56	24
29	58	26
30	60	28
31	62	30

Thread	Word	Bank
0	0	0
1	32	0
2	64	0
3	96	0
4	128	0
5	160	0
6	192	0
7	224	0
8	256	0
9	288	0
10	320	0
11	352	0
12	384	0
13	416	0
14	448	0
15	480	0
16	512	0
17	544	0
18	576	0
19	608	0
20	640	0
21	672	0
22	704	0
23	736	0
24	768	0
25	800	0
26	832	0
27	864	0
28	896	0
29	928	0
30	960	0
31	992	0

Thread	Word	Bank
0	0	0
1	2	2
2	4	4
3	6	6
4	8	8
5	10	10
6	12	12
7	14	14
8	16	16
9	18	18
10	20	20
11	22	22
12	24	24
13	26	26
14	28	28
15	30	30
16	32	0
17	34	2
18	36	4
19	38	6
20	40	8
21	42	10
22	44	12
23	46	14
24	48	16
25	50	18
26	52	20
27	54	22
28	56	24
29	58	26
30	60	28
31	62	30

Thread	Word	Bank
0	0	0
1	32	0
2	64	0
3	96	0
4	128	0
5	160	0
6	192	0
7	224	0
8	256	0
9	288	0
10	320	0
11	352	0
12	384	0
13	416	0
14	448	0
15	480	0
16	512	0
17	544	0
18	576	0
19	608	0
20	640	0
21	672	0
22	704	0
23	736	0
24	768	0
25	800	0
26	832	0
27	864	0
28	896	0
29	928	0
30	960	0
31	992	0