Mistral load_in_8bit slow inference

swtb · May 24, 2024, 1:23pm

Hello,

I have run mistral against llama 3 for fp16, 4 bit, and 8bit quantisation.

in both cases 8 bit is dramatically slower. however, mistral is significantly slower for inference.

I use the “generate” method as opposed to pipelines as the docs say pipelines is not optimised for 8bit.

Can someone please explain why 8bit is still slow for inference, and why specifically on mistral?

Topic		Replies	Views
Enabling load_in_8bit makes inference much slower 🤗Transformers	3	1819	February 13, 2024
Should 8bit quantization make inference faster on GPU? 🤗Transformers	1	678	April 1, 2024
Unable to load LLM with load_in_8bits 🤗Transformers	1	860	May 9, 2023
Some questions about GPT-J inference using int8 🤗Transformers	3	1431	January 24, 2023
Fine-tuning with load_in_8bit and inference without load_in_8bit possible? 🤗Transformers	4	24442	August 23, 2022