Hello,
I have run mistral against llama 3 for fp16, 4 bit, and 8bit quantisation.
in both cases 8 bit is dramatically slower. however, mistral is significantly slower for inference.
I use the “generate” method as opposed to pipelines as the docs say pipelines is not optimised for 8bit.
Can someone please explain why 8bit is still slow for inference, and why specifically on mistral?