Correct Usage of BitsAndBytesConfig

agademic · March 15, 2023, 2:02pm

Thank you for the quick reply. Unfortunately, with a setting of BitsAndBytesConfig(llm_int8_threshold=200.0) the model consumes still the ~ 40GB.

Here’s the code:

quantization_config = BitsAndBytesConfig(llm_int8_threshold=200.0)
model = AutoModelForCausalLM.from_pretrained(
    "EleutherAI/gpt-neox-20b",
    #torch_dtype=torch.float16,
    load_in_8bit=True,
    device_map="auto",
    quantization_config=quantization_config,
)

And here’s the nvidia-smi report:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03    Driver Version: 510.47.03    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   29C    P0    50W / 400W |  38609MiB / 40960MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      5608      C   /opt/conda/bin/python3          38607MiB |
+-----------------------------------------------------------------------------+

Topic		Replies	Views
Enabling load_in_8bit makes inference much slower 🤗Transformers	3	1810	February 13, 2024
BitsandBytes conflict with Accelerate 🤗Accelerate	6	786	April 14, 2025
Load_in_8bit vs. loading 8-bit quantized model 🤗Transformers	6	6987	May 13, 2024
Loading llama3.21B in quantized config shows no change in size Beginners	1	61	December 10, 2024
SmolVLM 8bit Quantization Problem Models	3	548	November 29, 2024

Correct Usage of BitsAndBytesConfig

Related topics