Hi all,
recently I was experimenting with inference speed for LLMs and I encountered an issue which I don’t know how to solve or whether its behaviour is expected. I was running the experiments on a 40GB A100.
When I was loading the GPT-NeoX-20B model via:
model = AutoModelForCausalLM.from_pretrained(
"EleutherAI/gpt-neox-20b",
load_in_8bit=True,
device_map="auto",
)
the inference was quite slow: ~ 14 secs for 50 tokens. Inference was consuming around 24GB VRAM.
Then I read this: Memory Decreases! But Latency Increases.... · Issue #6 · TimDettmers/bitsandbytes · GitHub.
Here, in order to increase inference speed they are proposing setting model = AutoModelForCausalLM.from_pretrained(...int8_threshold=0)
, which according to this issue seems to be another argument now: __init__() got an unexpected keyword argument 'int8_threshold' · Issue #21877 · huggingface/transformers · GitHub.
But when setting BitsAndBytesConfig(llm_int8_threshold=0.0) the model starts to consume around 40GB VRAM and inference runs OOM. Am I using it wrong or how can I find out its correct value? I was trying multiple values in steps up to a value of 6.0 but the model consumes 40GB at each value, making it consume as much memory as with fp16 weights. So am I missing something?
Ultimately, I want to increase the inference speed of GPT-NeoX so every hint on how to accomplish that is highly appreciated!
2 Likes
Hi @agademic
Thanks for the issue!
Can you try with a relatively large value, for example: llm_int8_threshold=200
and let us know if this helped?
Hi @ybelkada,
Thank you for the quick reply. Unfortunately, with a setting of BitsAndBytesConfig(llm_int8_threshold=200.0)
the model consumes still the ~ 40GB.
Here’s the code:
quantization_config = BitsAndBytesConfig(llm_int8_threshold=200.0)
model = AutoModelForCausalLM.from_pretrained(
"EleutherAI/gpt-neox-20b",
#torch_dtype=torch.float16,
load_in_8bit=True,
device_map="auto",
quantization_config=quantization_config,
)
And here’s the nvidia-smi report:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03 Driver Version: 510.47.03 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A100-SXM... Off | 00000000:00:04.0 Off | 0 |
| N/A 29C P0 50W / 400W | 38609MiB / 40960MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 5608 C /opt/conda/bin/python3 38607MiB |
+-----------------------------------------------------------------------------+
1 Like
The documentation of BitsAndBytesConfig
says:
This replaces load_in_8bit
therefore both options are mutually exclusive.
So it appears that specifying load_in_8bit
in .from_pretrained()
no longer has any effect once you specify quantization_config
.
I have tried the following and it correctly uses only 20GB of VRAM:
quantization_config = BitsAndBytesConfig(load_in_8bit=True,
llm_int8_threshold=200.0)
model = AutoModelForCausalLM.from_pretrained(
"EleutherAI/gpt-neox-20b",
torch_dtype=torch.float16,
device_map="auto",
quantization_config=quantization_config,
)
5 Likes