Quantisation

What is quantisation?

Quantisation reduces the numerical precision of a model’s weights — from 32-bit floating point to 8-bit, 4-bit, or even 2-bit integers. The model gets smaller, runs faster, and uses less memory. The trade-off is a small reduction in output quality.

Here is the scale. Llama 3.1 70B at full precision: 140 GB. At 4-bit quantisation: 35 GB. At 2-bit: 18 GB. The 4-bit version runs on a single GPU that costs $1/hour. The full-precision version needs four GPUs at $4/hour. Output quality difference on MMLU: approximately 1–2 percentage points.

Why it matters

Quantisation is how open-weights models become practical for self-hosting. Without it, running a 70B model requires enterprise hardware. With it, the same model runs on consumer GPUs or cloud instances at a fraction of the cost. Ollama and LM Studio both use quantised models by default. sourc.dev tracks open weights support — quantisation is why that attribute matters for cost.

What is quantisation?

Related terms