Static quantization of activations for transformers

abhinavkulkarni · August 4, 2023, 11:49am

Hi,

I am currently using a near-SOTA technique for quantizing weights of large language models such as GPT and LLaMA 2. This technique is W4A16, that is weights are quantized to 4 bits, but activations are kept in fp16. I would like to further quantize the activations to 8 bits to reduce the memory footprint.

How do I go about this?

Thanks!

regisss · August 9, 2023, 4:54pm

Hi @abhinavkulkarni, Optimum cannot perform W4A8 quantization.
It seems that recent quantization techniques are either W4A16 or W8A8, so I guess there is a big drop in accuracy with W4A8.

fxmarty · August 11, 2023, 3:16pm

@abhinavkulkarni You may be interested in this work: [2305.17888] LLM-QAT: Data-Free Quantization Aware Training for Large Language Models

Topic		Replies	Views
Model size-quantization tradeoff for local offline inference Intermediate	1	90	February 7, 2025
Optimum roberta base quantization model recall drop 10% 🤗Optimum	5	470	January 15, 2024
Quantization of Transformers model 🤗Transformers	0	75	May 29, 2024
Bitsandbytes quantization and QLORA fine-tuning 🤗Transformers	1	271	November 5, 2024
Optimum library optimization and quantization fails 🤗Optimum	8	1541	February 22, 2025

Static quantization of activations for transformers

Related topics