The number zoo

BF16, FP8, FP4, NVFP4, MXFP4, INT8, Q4_K, GGUF. A pile of names for one small question: how do you spend a handful of bits per number? Two knobs explain all of it.

Why there are so many formats

On the index, Phase 5 shrank the numbers from FP16 to FP4 and the tensor cores sped up. That opened a question the hardware tour skipped: what is an FP4 number, and why are there a dozen four-bit formats with different names?

The whole zoo comes from turning just two knobs:

  1. The bit split. A float spends its bits on range (the exponent) or precision (the mantissa). Move the split and you get a different format at the same width.
  2. The scale sharing. Once numbers get tiny, they lean on a shared scale factor. How many values share one scale is the second knob, and it is the entire difference between MXFP4 and NVFP4.

Turn the first knob now. Pick a format and watch its bits rearrange.

BF16 is 16 bits: 1 sign, 8 exponent, 7 mantissa. It reaches about ±3.4 × 10^38 with roughly 2.1 decimal digits of precision. Use: the training default: FP32 range, so no loss scaling.

Show as data
Floating-point formats: bit split, range, and precision
FormatBitsS/E/MMax ±~digits
FP32321/8/233.4 × 10^386.9
TF32191/8/103.4 × 10^383.0
FP16161/5/1065,5043.0
BF16161/8/73.4 × 10^382.1
FP8 E4M381/4/34480.9
FP8 E5M281/5/257,3440.6
FP6 E3M261/3/2280.6
FP4 E2M141/2/160.3

Anatomy of a float: range versus precision

Every floating-point number is one sign bit, some exponent bits that set how big or small it can get, and some mantissa bits that set how finely it resolves values in between. More exponent means more range. More mantissa means more precision. The width is fixed, so the two trade against each other. A couple of names on the cheat sheet are just points on this dial: TF32 keeps FP32's 8 exponent bits but trims the mantissa to 10 (19 real bits the tensor cores treat as a cheap FP32 stand-in), and FP6 is a middle rung between FP8 and FP4.

The cleanest example is FP16 versus BF16. Both are 16 bits. FP16 keeps 10 mantissa bits (fine precision) but only 5 exponent bits (narrow range). BF16 does the opposite: 8 exponent bits, the same range as FP32, with just 7 mantissa bits. That is why training loves BF16. The gradients (the correction signals used to update the model during training, which span a huge range of magnitudes) never overflow, so you skip loss-scaling (manually multiplying them up so they do not vanish in FP16's narrow range) that FP16 needs.

When integers win

Not every format is a float. INT8 and INT4 drop the exponent entirely: the values are just evenly spaced steps, and one shared scale stretches them over the real range. Even spacing is a good fit when a tensor's values are already well-behaved, and integer math is cheap.

This is the format behind a lot of quantized inference, and behind SageAttention, which runs the attention matmul on INT8 tensor cores after smoothing away the outliers. Floats handle wild ranges better; integers are simpler and pack tightly when the range is tame.

Click any square to drop in an outlier and see how far its damage spreads.

Scale overhead 0.25 bits / value
4-bit payload really costs 4.25 bits / value
Outlier blast radius 32 values

OCP microscaling: a shared power-of-two scale per 32 values.

MX (block 32): one E8M0 scale per 32 values. That adds 0.25 overhead bits per value, so a 4-bit payload costs 4.25 bits. An outlier distorts 32 values.

Show as data
Scaling schemes: block size, scale format, and overhead on a 4-bit payload
SchemeValues / scaleScaleOverhead b/val4-bit really costs
Per-tensorwhole tensorFP320.004.00
Per-channel256FP320.134.13
MX (block 32)32E8M00.254.25
NVFP4 (block 16)16FP8 E4M30.504.50

The scaling ladder: who shares a scale

Four bits can barely tell numbers apart, so a low-precision value is never stored alone. It comes with a scale factor, a full-precision multiplier that stretches the tiny payload back to real magnitudes. The one design choice left is how many values share one scale.

Share one scale across a whole tensor and it is almost free, but a single giant outlier drags the scale up and flattens everything else to zero. Share one scale per small block and an outlier only spoils its own block. The cost is overhead: the scales themselves take bits. That trade is the ladder above.

Quick checkPer-tensor scaling adds almost no overhead bits. Why not always use it for FP4?

MXFP4 versus NVFP4: same idea, finer grain

Both are four-bit floats (E2M1) with a shared block scale. The difference is only in the second knob:

  • MXFP4 is the open OCP (Open Compute Project) microscaling standard. One scale per 32 values, and that scale is a plain power of two (E8M0, just an exponent). Cheap: 0.25 overhead bits per value.
  • NVFP4 is NVIDIA's variant. One scale per 16 values, and the scale is a real FP8 number (not just a power of two), plus one FP32 scale for the whole tensor. Finer blocks and a richer scale mean better accuracy at 4-bit, for 0.5 overhead bits per value.

So NVFP4 spends a bit more to resolve the numbers more faithfully. Same family, tighter grain. The same split (block 32, power-of-two scale, versus block 16, FP8 scale) also distinguishes MXFP8 from the older per-tensor FP8.

W4A16 vs W4A4: what you quantize, not just how much

So far the knobs were about a single number. But a matmul has two operands: the weights (the fixed numbers baked into the model) and the activations (the data flowing through it). You can quantize them independently, and the shorthand for that is WnAm: n-bit weights, m-bit activations. W4A16 is 4-bit weights with 16-bit activations; W4A4 is four bits for both.

The split matters because the two operands win you different things:

  • Weight-only (W4A16, also W4A8). Quantize just the weights; the matmul unpacks them back to 16-bit and computes there. The win is memory: when you serve one user a token at a time, decode is memory-bound and the weights dominate the bytes moved, so 4-bit weights cut that traffic about 4x. It does not speed up the math. GPTQ, AWQ, and the GGUF k-quants are all W4A16.
  • Weight and activation (W4A4, W8A8). Quantize both, so the matmul itself runs on low-precision tensor cores. Now you win compute too, not just memory. The catch is that activations carry outliers, so naive 4-bit activations wreck accuracy. The fixes are the ones from this page: microscaling (NVFP4 is W4A4 done well) or smoothing and rotation tricks like SmoothQuant and QuaRot. W8A8 (INT8 both) is the mature version; W4A4 is the frontier.
What each recipe quantizes and what it buys
RecipeWeightsActivationsSpeeds upExamples
W16A1616-bit16-bitbaselineBF16 training/serving
W8A88-bit8-bitmemory + computeINT8 / FP8 serving
W4A164-bit16-bitmemory onlyGPTQ, AWQ, GGUF
W4A44-bit4-bitmemory + computeNVFP4, INT4 (frontier)
FP16
16.0

bits per weight (bar) vs FP16 →

Q4_K_M stores each weight in about 4.53 bits, 3.5 times smaller than FP16. the popular default: near-lossless, half the size.

Show as data
GGUF k-quant types, effective bits per weight, and shrink vs FP16
TypeBits/weightShrink vs FP16
Q2_K2.636.1×
Q3_K_M3.914.1×
Q4_K_M4.533.5×
Q5_K_M5.502.9×
Q6_K6.562.4×
Q8_08.501.9×

On your laptop: GGUF and the k-quants

The formats above chase throughput on datacenter tensor cores. The Q4_K, Q5_K, Q6_K names you meet downloading a model are a different world: llama.cpp running one user's chat on a laptop CPU or a Mac, where the job is memory-bound and there are no tensor cores to feed.

GGUF is the file format, the container that holds the weights plus metadata. The Qn_K types inside it are k-quants: weights packed into super-blocks of 256, split into sub-blocks of 16 or 32 (depending on the type), with a two-level scale (a scale and a min, so a block need not center on zero, per sub-block, themselves scaled by the super-block). That two-level structure is the k-quant ("K") scheme, and it is why a 4-bit k-quant stays remarkably close to the original.

Same core trick as NVFP4: a small payload plus a shared block scale. Different design point, because the bottleneck and the hardware are different. That is the whole zoo, twice.

The floor: the 1.58-bit LLM

Every format so far kept a number as a number, just with fewer bits. Ternary throws that out. In BitNet b1.58 (Microsoft, 2024) every weight is one of only three values: -1, 0, or +1. Three possibilities need log2(3) ≈ 1.58 bits, which is where the name comes from. That is below FP4, below INT4, about as low as a number can go and still exist.

Why 1.58 and not a clean 1 bit? One bit gives you two values, -1 and +1 (the original BitNet). Adding the zero, letting a weight switch off, is what lifted quality to match a full-precision model of the same size. The knob here is not range versus precision; it is how many distinct values a weight may take at all.

The values come from absmean quantization: scale the weight matrix by its average absolute value, then round each weight to the nearest of -1, 0, +1. Activations stay at 8-bit, so in the WnAm shorthand BitNet is W1.58A8.

Here is the enigmatic part. When a weight can only be -1, 0, or +1, the matmul stops multiplying. Multiplying an activation by +1 keeps it, by -1 negates it, by 0 drops it. The single operation every GPU is built to accelerate collapses into plain addition and subtraction:

That is why the paper's title calls it "the era of 1-bit LLMs" and argues for hardware built for addition, not multiplication. One important catch: you cannot reach ternary by squashing a finished model the way GPTQ or a k-quant does. BitNet is trained from scratch with the rounding in the loop (it keeps full-precision shadow weights during training and only rounds on the forward pass), so ternary is a training decision, not a post-hoc compression. It is still a research frontier: Microsoft has released a 2-billion-parameter model (BitNet b1.58 2B4T) and a CPU runtime (bitnet.cpp), but whether it holds up at frontier scale is open.

Format cheat sheet

One table for the road. Where each format lives, and the hardware that first ran it.

The number formats in this page at a glance
FormatBitsKindWhere it lives
FP3232floatmaster weights, accumulators
TF3219floattensor-core FP32 stand-in (Ampere+)
BF1616floatthe training default
FP1616floatinference, mixed-precision training
FP8 (E4M3 / E5M2)8floatHopper+ training and serving
MXFP88 + scalefloat + blockblock-scaled FP8 (block 32)
FP66floatBlackwell middle ground
NVFP44 + scalefloat + blockBlackwell serving (block 16, FP8 scale)
MXFP44 + scalefloat + blockopen microscaling (block 32)
INT8 / INT48 / 4integerquantized inference, SageAttention
Q4_K_M~4.5k-quantllama.cpp / GGUF, laptop inference
Q6_K~6.6k-quantnear-lossless local inference
BitNet b1.58~1.58ternarytrained from scratch, multiply-free (research)
Thread
The smallest unit of work on a GPU. One thread runs one instance of the kernel on one lane.
Warp
A group of 32 threads that execute the same instruction together, in lockstep. The scheduling unit of the GPU.
SIMT
Single Instruction, Multiple Threads. All 32 threads of a warp run one shared instruction over their own data.
Lockstep
All threads in a warp advance together on the same instruction at the same time.
Coalescing
When a warp reads neighbouring addresses so the hardware serves them in as few memory transactions as possible.
Sector
The 32-byte unit the hardware fetches from global memory. A warp wants its data packed into as few sectors as possible.
HBM
High Bandwidth Memory, the large off-chip global memory. Biggest and slowest tier, hundreds of cycles away.
Shared memory
Fast on-chip scratchpad private to one thread block. Split into 32 banks.
Register
Per-thread on-chip storage. The fastest memory, about one cycle to access.
L2 cache
On-chip cache shared by all SMs, sitting between the per-SM L1 caches and global HBM.
SM
Streaming Multiprocessor. A core building block of the GPU that runs thread blocks. An A100 has 108 of them.
Sub-partition
One of the four processing blocks inside an SM. Each has its own warp scheduler and execution units.
Warp scheduler
The unit that picks one eligible warp each cycle and issues its next instruction. Four per SM.
Eligible
A warp that is ready to issue this cycle, not waiting on memory or a dependency.
Stalled
A warp that cannot issue yet because it is waiting, usually on a memory load.
Latency hiding
Keeping the machine busy during long waits by running other ready warps while one warp stalls.
Occupancy
The number of active warps on an SM divided by the maximum it can hold. More warps give the scheduler more to switch to.
Bank
One of the 32 slots shared memory is split into. Consecutive 4-byte words map to consecutive banks (word w lands in bank w mod 32), and each bank serves one word per cycle.
Bank conflict
When two or more threads in a warp want different words in the same bank. Their reads serialize.
Tiling
Loading a small block of a matrix into shared memory once so every thread in the block reuses it many times.
Data reuse
Using a value staged in fast memory many times before fetching new data, so slow global memory is touched as little as possible.
GEMM
General matrix multiply, C = A times B. The workhorse operation behind neural networks and the main thing GPUs are tuned for.
cp.async
An asynchronous copy from global memory straight into shared memory, without stalling the thread or passing through registers.
Software pipelining
Overlapping the load of the next tile with the compute on the current one, so memory latency hides behind useful work.
Double buffering
Using two shared-memory buffers that take turns: one is being computed on while the other is being filled by the next load.
Prefetch
Starting a load early, before the data is needed, so it has arrived by the time you use it.
Thread block
A group of threads that run on one SM and share its shared memory. Also called a CTA.
Register file
The pool of registers on an SM, shared out among all resident threads. About 256 KB on an A100.
Register pressure
How many registers a kernel needs per thread. High pressure means fewer warps fit, which lowers occupancy.
Register spilling
When a thread needs more registers than it has, the extra values spill to local memory, which actually lives in slow global memory.
Local memory
Per-thread memory that, despite the name, lives in slow off-chip global memory. Registers spill here when they run out.
TMA
Tensor Memory Accelerator. A Hopper copy engine that moves whole tensor tiles between global and shared memory from a single descriptor, so one thread issues the load.
Tensor map descriptor
A small host-built struct (128 bytes) that tells TMA the tensor base, shape, strides, tile size, element type, and swizzle. One thread passes it to issue a bulk copy.
mbarrier
An asynchronous barrier in shared memory. TMA signals it when a tile lands and waiting threads wake, handing each buffer stage from producer to consumer.
wgmma
Warpgroup matrix multiply. A Hopper instruction where 128 threads issue one asynchronous tensor-core matmul that reads its operands from shared memory.
Warpgroup
Four contiguous warps, 128 threads, the granularity Hopper wgmma operates on.
Warp specialization
Giving different warps different jobs: producer warps issue TMA loads while consumer warps run wgmma, overlapping load and compute.
Thread block cluster
A Hopper group of blocks co-scheduled on one GPC that can read each other’s shared memory (distributed shared memory).
Multicast
A TMA mode that broadcasts one global load into several blocks’ shared memory in a cluster, so a shared operand crosses the bus only once.
Tensor Memory
A dedicated on-SM memory on Blackwell (256 KB) that holds the MMA accumulator, so the register file no longer has to feed the tensor cores at FP4 rates.
tcgen05
Blackwell’s 5th-generation tensor-core MMA. A single thread issues the matmul for the whole block, reading operands from shared memory and TMEM.
Microscaling
Storing a block of low-precision values (say FP4) with one shared low-precision block scale (E8M0 for MXFP4, FP8 for NVFP4), so tiny formats stay accurate. NVFP4 uses a block of 16.
FP4
A 4-bit floating-point format (E2M1). Packs eight values per 32 bits and roughly quadruples tensor-core peak over FP16, at the cost of range.
Accumulator
The running sum D in D = A × B that a matmul builds up across its K steps. Where it lives (registers or TMEM) is a recurring bottleneck.
CuTe
The layout layer under CUTLASS. Expresses thread-to-data mappings as Shape ⊗ Stride and composes, tiles, and swizzles them at compile time.
CUTLASS
NVIDIA’s open template library for peak-performance GEMM and related kernels, built on CuTe layouts.
Layout
A CuTe object, Shape ⊗ Stride, that maps a logical coordinate to a linear memory offset. Change the stride to re-lay-out data without moving it.
Stride
How far apart, in memory, consecutive elements along an axis sit. A stride of 1 means contiguous, which is what makes a read coalesce.
Swizzle
A layout that permutes shared-memory addresses so a tile reads back bank-conflict-free and in the order the tensor cores want.
Roofline
A plot of attainable throughput against arithmetic intensity. An op is memory-bound under a bandwidth roof until enough reuse lifts it to the flat compute roof.
Arithmetic intensity
FLOPs performed per byte moved from memory. Low intensity is memory-bound; tiling raises it until the op becomes compute-bound.
Exponent
A float's exponent bits set how big or small it can get. More exponent means more dynamic range.
Mantissa
A float's mantissa bits set how finely it resolves values between powers of two. More mantissa means more precision.
BF16
Brain float 16: 8 exponent bits and 7 mantissa bits. Same range as FP32 with less precision, which is why it is the training default.
FP8
An 8-bit float in two flavors: E4M3 (more precision, forward pass) and E5M2 (more range, gradients). Introduced for tensor cores on Hopper.
Quantization
Storing weights or activations in fewer bits than they were trained in, usually with a shared scale factor to recover the real magnitudes.
INT8
An 8-bit integer format with evenly spaced steps and a shared scale. Cheap and tight when a tensor has no wild outliers.
Ternary (BitNet b1.58)
Weights restricted to three values, -1, 0, and +1, about 1.58 bits each. The matmul becomes addition and subtraction with no multiplies; trained from scratch, not compressed after.
NVFP4
NVIDIA's 4-bit float (E2M1) with one FP8 scale per 16 values plus a per-tensor FP32 scale. Finer blocks than MXFP4 for better accuracy.
MXFP4
The open OCP microscaling 4-bit float: one power-of-two scale (E8M0) shared across every block of 32 values.
GGUF
The llama.cpp file format that packs a model plus metadata for local inference. Holds k-quant tensor types like Q4_K_M.
k-quant
A GGUF quantization scheme: weights in super-blocks of 256, split into sub-blocks of 32, with a two-level (super-block and sub-block) scale.
Attention
The operation that lets each token weigh every other token: score queries against keys, softmax the scores, then blend the values.
Softmax
Turns a row of numbers into positive weights that sum to 1, so they can act as attention weights or class probabilities.
Token
One chunk of a model’s input or output, roughly a word or word-piece. Sequences are measured in tokens.
Gradient
The correction signal used to update a model during training. Gradients span a huge range of magnitudes, which is why their number format needs range.
Tensor core
A dedicated unit inside the SM that multiplies a small matrix in one instruction, far faster than ordinary threads doing it multiply by multiply. First shipped on Volta.
MMA
Matrix multiply-accumulate: the tensor core’s core operation, D = A × B + C. wgmma (Hopper) and tcgen05.mma (Blackwell) are MMA instructions.
FLOP
One floating-point operation, a single multiply or add. Throughput is measured in FLOPs per second (FLOP/s).
Activation
The data flowing through a model (the inputs and intermediate results), as opposed to the fixed weights. Activations carry outliers, which makes them harder to quantize.
W4A16
A quantization recipe: 4-bit weights, 16-bit activations. Weight-only, so it saves memory but computes in 16-bit. GPTQ, AWQ, and GGUF are W4A16.
W4A4
A quantization recipe: 4-bit weights and 4-bit activations. The matmul runs on low-precision tensor cores, saving compute too, but activation outliers make it hard. NVFP4 is W4A4.
SDPA
PyTorch's scaled_dot_product_attention: a dispatcher that auto-picks a fused FlashAttention-style backend (flash, memory-efficient, or cuDNN) for your shapes and dtype.
FlexAttention
A PyTorch API for writing custom attention masks and score modifications as a small function that still compiles to one fused FlashAttention-style kernel.
SageAttention
Quantized attention: runs the attention matmuls in INT8 or FP4 on tensor cores, smoothing outlier channels first. The W4A4 idea applied to attention.
SonicMoE
A mixture-of-experts kernel library that packs routed tokens into contiguous groups so each expert’s grouped-GEMM tile is full and efficient.
Data parallel
Replicate the whole model on every GPU, split the batch, and average gradients with an all-reduce. DDP is the efficient PyTorch version.
FSDP / ZeRO
Sharded data parallel: split the batch AND shard params, gradients, and optimizer state across GPUs, gathering each layer just in time. Trades communication for memory.
Tensor parallel
Split each layer's weight matrices across GPUs (Megatron splits heads and MLP columns) and all-reduce the activations. Chatty, so it wants NVLink.
Pipeline parallel
Put different layers on different GPUs as stages; activations flow stage to stage. The idle time while the pipeline fills and drains is the "bubble."
Expert parallel
Spread a mixture-of-experts layer’s experts across GPUs and all-to-all the tokens to wherever their expert lives. DeepEP accelerates the shuffle.
Context / sequence parallel
Split the sequence across GPUs. Sequence parallel saves activation memory on the non-matmul parts; context parallel (ring attention) splits attention over ultra-long contexts.
All-reduce
A collective that sums a value across all GPUs and hands every GPU the total. Used to average gradients (data parallel) and combine activations (tensor parallel).
All-to-all
A collective where every GPU sends a different piece to every other GPU. Used to route tokens to their expert in expert parallelism.
KV cache
The stored Keys and Values of every past token, kept so they are not recomputed each step. It dominates long-context memory and grows with the number of KV heads, head size, layers, and sequence length.
RoPE
Rotary position embedding: encodes position by rotating the query and key vectors by an angle proportional to position, so the attention score depends on relative distance. Applied every layer, not added to embeddings.
NoPE
No positional encoding: a decoder-only causal model can infer position from the causal mask alone (a counting signal), with no explicit position input. Works only in the causal setting and has a finite usable range.
Sliding window attention
Each token attends only to the last W tokens (a local band), capping cost and local KV cache. Stacking layers still compounds the reach, so the model is not limited to W.
Attention sink
A learned bias that keeps the always-important first tokens in the softmax denominator, letting a very small sliding window (like gpt-oss’s 128) stay stable.
MHA
Multi-head attention: every query head has its own Key and Value head. Best quality, largest KV cache.
MQA
Multi-query attention: all query heads share a single Key/Value head. Smallest KV cache, but the hard sharing can cost quality.
GQA
Grouped-query attention: query heads are split into groups, each sharing one Key/Value head. Near-MHA quality at close to MQA memory, and the mainstream default.
MLA
Multi-head latent attention (DeepSeek): compress every head's K and V into one small shared latent, cache only that, and reconstruct per-head K/V on the fly. GQA-level cache at MHA quality, with a small decoupled RoPE key.
DSA
DeepSeek Sparse Attention (V3.2): a lightning indexer scores past tokens and each query attends only to its top-k (2048), turning attention from order N squared into order N times k. Built on MLA.
YaRN
A RoPE context-extension method: scale the rotation frequencies (NTK-style) and add an attention-temperature correction, so a model trained at one length works at a longer one.