Hi,
I am currently using a near-SOTA technique for quantizing weights of large language models such as GPT and LLaMA 2. This technique is W4A16, that is weights are quantized to 4 bits, but activations are kept in fp16. I would like to further quantize the activations to 8 bits to reduce the memory footprint.
How do I go about this?
Thanks!