Correct Usage of BitsAndBytesConfig

Hi all,

recently I was experimenting with inference speed for LLMs and I encountered an issue which I don’t know how to solve or whether its behaviour is expected. I was running the experiments on a 40GB A100.

When I was loading the GPT-NeoX-20B model via:

model = AutoModelForCausalLM.from_pretrained(
    "EleutherAI/gpt-neox-20b",
    load_in_8bit=True,
    device_map="auto",
)

the inference was quite slow: ~ 14 secs for 50 tokens. Inference was consuming around 24GB VRAM.

Then I read this: Memory Decreases! But Latency Increases.... · Issue #6 · TimDettmers/bitsandbytes · GitHub.
Here, in order to increase inference speed they are proposing setting model = AutoModelForCausalLM.from_pretrained(...int8_threshold=0), which according to this issue seems to be another argument now: __init__() got an unexpected keyword argument 'int8_threshold' · Issue #21877 · huggingface/transformers · GitHub.

But when setting BitsAndBytesConfig(llm_int8_threshold=0.0) the model starts to consume around 40GB VRAM and inference runs OOM. Am I using it wrong or how can I find out its correct value? I was trying multiple values in steps up to a value of 6.0 but the model consumes 40GB at each value, making it consume as much memory as with fp16 weights. So am I missing something?

Ultimately, I want to increase the inference speed of GPT-NeoX so every hint on how to accomplish that is highly appreciated!

1 Like

cc @ybelkada

Hi @agademic
Thanks for the issue!
Can you try with a relatively large value, for example: llm_int8_threshold=200 and let us know if this helped?

Hi @ybelkada,

Thank you for the quick reply. Unfortunately, with a setting of BitsAndBytesConfig(llm_int8_threshold=200.0) the model consumes still the ~ 40GB.

Here’s the code:

quantization_config = BitsAndBytesConfig(llm_int8_threshold=200.0)
model = AutoModelForCausalLM.from_pretrained(
    "EleutherAI/gpt-neox-20b",
    #torch_dtype=torch.float16,
    load_in_8bit=True,
    device_map="auto",
    quantization_config=quantization_config,
)

And here’s the nvidia-smi report:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03    Driver Version: 510.47.03    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   29C    P0    50W / 400W |  38609MiB / 40960MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      5608      C   /opt/conda/bin/python3          38607MiB |
+-----------------------------------------------------------------------------+

The documentation of BitsAndBytesConfig says:

This replaces load_in_8bit therefore both options are mutually exclusive.

So it appears that specifying load_in_8bit in .from_pretrained() no longer has any effect once you specify quantization_config.

I have tried the following and it correctly uses only 20GB of VRAM:

quantization_config = BitsAndBytesConfig(load_in_8bit=True,
                                         llm_int8_threshold=200.0)
model = AutoModelForCausalLM.from_pretrained(
    "EleutherAI/gpt-neox-20b",
    torch_dtype=torch.float16,
    device_map="auto",
    quantization_config=quantization_config,
)
5 Likes