Some questions about GPT-J inference using int8

PanEa · October 29, 2022, 10:18am

Currently huggingface transformers support loading model into int8, which saves a lot GPU VRAM.

I’ve tried it in GPT-J, but found that the inference time comsume in int8 is much slower, about 8x more than in the normal float16.

Can somebody tell me why and how can I solve it?

Topic		Replies	Views
Enabling load_in_8bit makes inference much slower 🤗Transformers	3	1809	February 13, 2024
Unable to inference in 8bit mode: 'NoneType' object has no attribute 'device' 🤗Transformers	4	2287	December 14, 2023
Unable to load LLM with load_in_8bits 🤗Transformers	1	858	May 9, 2023
Mistral load_in_8bit slow inference 🤗Transformers	0	251	May 24, 2024
Fine-tuning with load_in_8bit and inference without load_in_8bit possible? 🤗Transformers	4	24414	August 23, 2022