Large Language Model (LLM) consisting ability to generate tokens for n number of instances. Now a days, LLM use-cases are increased, at the same side the size and system requirements are also increased. To fix this many research are contributed in this field. At the end we have LLM models where we decreased the memory format from Floating point 32 (fp32)to floating point 8(fp8) format. This process is known as Quantization. The solution is really affected the field of LLM. After this anyone can use LLM on their local device without any interrupt. Now a days, a research was going crazy, let’s discuss that. This paper claims the possibility of a better LLM that is on 1 bit Quantized memory format.

I Read the research paper on (Nov 2024). Here is the insights regarding this.
reference : Click here
Recently another research is published by Microsoft where they use 4bit activation for making a LLM.
reference : Click here

QUANTISATION

What is Quantization ?

In Quantization, the higher memory format is converted into lower memory format. This conversion is known as calibration. By using quantization process the memory format is scaled down. The model is quantized for fast inference and easy to finetuning. During the fine tuning loss of data can be occurs that’s why model accuracy is decreased due to finetuning.
There are two types of calibration :
1. Symmetric (ex. Batch Normalization)
2. Asymmetric

Symmetric Quantization:

Symmetric Quantization a method of reducing the size of a model by converting its weights and activations from high-precision floating-point values to lower-precision integers, where the range of values used is evenly distributed around zero.
[0,1,2,3,4,…………………..,100] → Weight → 32 bits (float value)
[0,1,2,3,4,………………….,225] → Weight → 8 bits (int value)
I have a model of weights given is in 32 bits format, now I have to quantize the model from 32 bits to uint8 i.e in between [0 to 225].

Here i used minmax scaler to scale down the values in the memory format. The representation looks like : -

Asymmetric Quantization :

Asymmetric quantization maps the floating point numbers from [β,α][β,α] into [0,2n−1][0,2n−1], where nn is the number of bits in the quantized version (for example, if n=8n\=8, the floating point numbers will be represented in the [0,255][0,255] range), and ββ and αα are the minimum and the maximum floating point numbers in the original tensor.
Symmetric quantization is simpler and more computationally efficient than asymmetric quantization, but doesn't work as well with the skewed data distributions (such as activations) and (in general) results in lower accuracy. The two types could be combined, where activations and the input will be quantized in the asymmetric setting, and the weights will be quantized in the symmetric one.

BITNET, 1 Bit Quantization -

In BITNET, Quantization is working as factor where we are going to re-quantize the memory format to 1bit that is convert it to binary bits (0 OR 1). This process is also known as Binarization. It makes the model run faster and size of the model also get decreased. Due to compression may be the performance get affected.
There are high advantage to quantize the LLMs. The first advantage , of course the speed and the space. The second advantage is the constant statistical properties.

During the training process, in case of 8 bit or 4 bit LLMs, the variance is get changed during the training process. That’s why there will be some statistical changes occurs, but in case of 1 bit LLMs , the product matrix is used as the proxy for the statistical property. That’s why the product matrix should have the same variance before and after the quantization.

Another advantage is backpropagation happens in a high precision. This is to ensure that small changes to the model’s output are accurately captured and reflected in the updated weights.

Training the model :

Straight-through estimator (STE):
When training a model with 1-bit weights, some parts of the process are not differentiable (e.g., the "Sign" and "Clip" functions), which means we can't directly compute gradients for them. The straight-through estimator is a trick that lets us ignore these issues during backpropagation and still allow the gradients to flow through the model. It helps us train the model even though certain operations aren't easily differentiable.
Mixed precision training:
To make training more efficient, the model's weights and activations are stored in lower precision (like 1-bit), but the gradients (which help in adjusting the weights) and optimizer states are kept in higher precision (like float32). This helps maintain the stability and accuracy of the training process. They also keep a "latent weight" in high precision for updates, but only use the low precision (1-bit) weights during the actual forward pass and inference.
Large learning rate:
With 1-bit weights, small updates to the weights often don’t make much of a difference, which can slow down training. To speed up the process, the team used a large learning rate. This helped their model (called BitNet) converge faster during training. However, they found that using the same large learning rate with a normal floating-point model (like FP16) caused it to "diverge" or fail to train properly, which shows that BitNet handles the large learning rate better.

Here’s the performance index for all mem. format LLMs.

Conclusion :

BitNet is a new type of Transformer model that uses 1-bit weights, making it more efficient for large language models. It’s designed to be scalable (it can grow larger as needed) and stable. The results show that BitNet performs well, with low perplexity and good results on downstream tasks, while using much less memory and energy than traditional models. BitNet also follows a similar growth pattern to full-precision Transformers, meaning it can be scaled up to even larger models with potential benefits in both performance and efficiency. In the future, the team plans to make BitNet even larger and explore using it in other types of models, like RetNet, for training big language models.

1 BIT Quantization, Is it lit or mid .

Exploring the next generation of large language models