Blog Entry for Website (Point-Wise Structure)

Title: “LLM Quantisation Unpacked: AWQ v/s GGUF”

A primary advantage of quantisation is significant reductions in the model size. Let’s consider our example, a model with a size of 7B (70 billion parameters).
The size of a model generally corresponds to the number of parameters it has. Each parameter in a floating-point model typically uses 32 bits or 4 bytes in memory. So the size of a model can be approximately calculated as the number of parameters multiplied by the size in memory of each parameter.
Now, let’s consider a model with 7 billion parameters:
The size of the parameters in a 32-bit floating-point model: 7B (parameters) * 4 (bytes/parameter) = 28GB
But if we quantise the parameters to 8 bits, each parameter now takes 1 byte in memory. So, the model size after 8-bit quantisation would be: 7B (parameters) * 1 (bytes/parameter) = 7GB
Similarly, with 4-bit quantisation each parameter now takes 0.5 bytes in memory. So, the model size after 4-bit quantisation would be: 7B (parameters) * 0.5 (bytes/parameter) = 3.5GB
As you can see, quantisation dramatically reduces the size of the model, making it more manageable to run on smaller devices or GPUs with less memory. However, it’s crucial to note that quantisation may also have a slight impact on the model’s performance depending on the bit-width used. The choice of quantisation strategy should account for this trade-off between size reduction and performance.

Speed - Faster on GGUF and AWQ.
Low Fine Tuning - More straightforward with Bits and Bytes, possible with gptq and GGUF, not yet available with awq.
Merging Adapters - Challenging in all approaches.
Saving Model in Quantized Format - Straightforward with GGUF, GPTQ, and AWQ, not feasible with Bits and Bytes.