What is Quantization?
Definition
Quantization is a technique that reduces AI model size and computational requirements by using lower precision numbers (e.g., 8-bit integers instead of 32-bit floats) to represent model weights and activations. This enables running large models on smaller hardware like consumer GPUs, CPUs, or even mobile devices.
How It Works
- FP32 (32-bit float): Full precision - 4 bytes per weight
- FP16 (16-bit float): Half precision - 2 bytes per weight
- INT8 (8-bit integer): 1 byte per weight, significant compression
- INT4 (4-bit integer): 0.5 bytes per weight, extreme compression
Benefits
- Smaller Size: 4x reduction for INT8, 8x for INT4
- Faster Inference: Faster computation on compatible hardware
- Lower Memory: Run large models on consumer GPUs
- Cost Savings: Reduced compute costs for API services