Correct Usage of BitsAndBytesConfig

Hi @ybelkada,

Thank you for the quick reply. Unfortunately, with a setting of BitsAndBytesConfig(llm_int8_threshold=200.0) the model consumes still the ~ 40GB.

Here’s the code:

quantization_config = BitsAndBytesConfig(llm_int8_threshold=200.0)
model = AutoModelForCausalLM.from_pretrained(
    "EleutherAI/gpt-neox-20b",
    #torch_dtype=torch.float16,
    load_in_8bit=True,
    device_map="auto",
    quantization_config=quantization_config,
)

And here’s the nvidia-smi report:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03    Driver Version: 510.47.03    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   29C    P0    50W / 400W |  38609MiB / 40960MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      5608      C   /opt/conda/bin/python3          38607MiB |
+-----------------------------------------------------------------------------+
1 Like