What is Quantization?

Infrastructure 5 min read

Definition

Quantization is a technique that reduces AI model size and computational requirements by using lower precision numbers (e.g., 8-bit integers instead of 32-bit floats) to represent model weights and activations. This enables running large models on smaller hardware like consumer GPUs, CPUs, or even mobile devices.

How It Works

  • FP32 (32-bit float): Full precision - 4 bytes per weight
  • FP16 (16-bit float): Half precision - 2 bytes per weight
  • INT8 (8-bit integer): 1 byte per weight, significant compression
  • INT4 (4-bit integer): 0.5 bytes per weight, extreme compression

Benefits

  • Smaller Size: 4x reduction for INT8, 8x for INT4
  • Faster Inference: Faster computation on compatible hardware
  • Lower Memory: Run large models on consumer GPUs
  • Cost Savings: Reduced compute costs for API services