Hi @ybelkada,
Thank you for the quick reply. Unfortunately, with a setting of BitsAndBytesConfig(llm_int8_threshold=200.0)
the model consumes still the ~ 40GB.
Here’s the code:
quantization_config = BitsAndBytesConfig(llm_int8_threshold=200.0)
model = AutoModelForCausalLM.from_pretrained(
"EleutherAI/gpt-neox-20b",
#torch_dtype=torch.float16,
load_in_8bit=True,
device_map="auto",
quantization_config=quantization_config,
)
And here’s the nvidia-smi report:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03 Driver Version: 510.47.03 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A100-SXM... Off | 00000000:00:04.0 Off | 0 |
| N/A 29C P0 50W / 400W | 38609MiB / 40960MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 5608 C /opt/conda/bin/python3 38607MiB |
+-----------------------------------------------------------------------------+