Enabling load_in_8bit makes inference much slower

feizheng · May 3, 2023, 9:36pm

I loaded the 7b llama on an A100 this way:

quantization_config = transformers.BitsAndBytesConfig(load_in_8bit=False, llm_int8_threshold=0.0)
model = LlamaForCausalLM.from_pretrained(
   "decapoda-research/llama-7b-hf",
    device_map="auto",
    torch_dtype=torch.float16,
    quantization_config=quantization_config,
)

When load_in_8bit is False, it generates 16.7 tokens/sec whereas load_in_8bit=True generates only 6.7 tokens/sec. It seems I probably set up configurations incorrectly for load_in_8bit=True.

My transformers version is 4.29.0.dev0. Did I miss anything? Thanks.

abhishek-wfx · August 30, 2023, 1:09pm

I am facing similar issue, did you find a solution?

chaochaoli · September 9, 2023, 7:48am

me too，and training is slower too

AbhijitHerekar · February 13, 2024, 7:05am

same here, not sure if anyone found the root cause.

Topic		Replies	Views
Error loading Llama model Beginners	5	1565	March 9, 2024
Why the model loading of llama2 is so slow? 🤗Transformers	6	9482	April 24, 2024
How to Load Llama-3.3-70B-Instruct Model in Float8 Precision? 🤗Transformers	1	289	December 11, 2024
Mistral load_in_8bit slow inference 🤗Transformers	0	243	May 24, 2024
Does load_in_8bit directly load the model in 8bit? (spoliler, do not seem like it) Beginners	0	1474	July 11, 2023

Enabling load_in_8bit makes inference much slower

Related topics