Quantization

A technique that reduces model size and memory usage by using lower-precision numbers for model weights.

Quantization converts model weights from high-precision formats (32-bit float) to lower-precision ones (8-bit, 4-bit, or even 2-bit). This shrinks model size, reduces memory requirements, and speeds up inference with minimal quality loss. It enables running large models on consumer GPUs. Common formats include GGUF, GPTQ, and AWQ.