Static quantization of activations for transformers


I am currently using a near-SOTA technique for quantizing weights of large language models such as GPT and LLaMA 2. This technique is W4A16, that is weights are quantized to 4 bits, but activations are kept in fp16. I would like to further quantize the activations to 8 bits to reduce the memory footprint.

How do I go about this?


Hi @abhinavkulkarni, Optimum cannot perform W4A8 quantization.
It seems that recent quantization techniques are either W4A16 or W8A8, so I guess there is a big drop in accuracy with W4A8.

@abhinavkulkarni You may be interested in this work: [2305.17888] LLM-QAT: Data-Free Quantization Aware Training for Large Language Models