I loaded the 7b llama on an A100 this way: quantization_config = transformers.BitsAndBytesConfig(load_in_8bit=False, llm_int8_threshold=0.0) model = LlamaForCausalLM.from_pretrained( "decapoda-research/llama-7b-hf", device_map="auto", torch_dtype=torch.float16, quantization_config=qu…

Enabling load_in_8bit makes inference much slower

chaochaoli September 9, 2023, 7:48am 3

me too，and training is slower too

Topic		Replies	Views
Mistral load_in_8bit slow inference 🤗Transformers	0	252	May 24, 2024
Should 8bit quantization make inference faster on GPU? 🤗Transformers	1	676	April 1, 2024
Some questions about GPT-J inference using int8 🤗Transformers	3	1425	January 24, 2023
Does load_in_8bit directly load the model in 8bit? (spoliler, do not seem like it) Beginners	0	1494	July 11, 2023
Load_in_8bit vs. loading 8-bit quantized model 🤗Transformers	6	6968	May 13, 2024