CUDA out of memory on multi-GPU

Hi,

I’m training a Llama2 7bn (“meta-llama/Llama-2-7b-hf”) on a ml.g5.24xlarge (96 GiB GPU split across 4 GPUs). I’m using 4bit quant and my batch size is 1. It was working fine until recently but suddenly started throwing some CUDA out of memory issues without me changing (as far as I know!) any meaningful parameter.

Can anyone help? Posting it here but let me know if there’s a better place (or if it’s a PyTorch issue!)
Thanks

Error message:
OutOfMemoryError: CUDA out of memory. Tried to allocate 8.58 GiB (GPU 0; 22.19 GiB total capacity; 11.81 GiB already allocated; 6.73 GiB free; 14.40 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Code extract: (note that a CustomerTrainer is used for making a slight mod to the loss function)
trainer = CustomTrainer(
model=model,
train_dataset=dataset_train,
eval_dataset=dataset_val,
callbacks=[CustomCallback],
args=CustomTrainingArguments(
per_device_train_batch_size=1,
gradient_accumulation_steps=4,
warmup_steps=2,
num_train_epochs = 3,
learning_rate=2e-4,
fp16=True,
logging_steps=1,
output_dir=output_dir,
optim=“paged_adamw_8bit”,
evaluation_strategy = “steps”,
eval_accumulation_steps = 4,
resume_from_checkpoint = resume_from_checkpoint,
report_to=“wandb”,
run_name=wandb_run_name # name of the W&B run (optional)
),
data_collator=CustomDataCollator(tokenizer, mlm=False),
)

!nvidia-smi returns:
Wed Aug 30 09:01:38 2023
±----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12 Driver Version: 525.85.12 CUDA Version: 12.0 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A10G On | 00000000:00:1B.0 Off | 0 |
| 0% 35C P0 62W / 300W | 15828MiB / 23028MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 1 NVIDIA A10G On | 00000000:00:1C.0 Off | 0 |
| 0% 37C P0 60W / 300W | 21496MiB / 23028MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 2 NVIDIA A10G On | 00000000:00:1D.0 Off | 0 |
| 0% 36C P0 58W / 300W | 21500MiB / 23028MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 3 NVIDIA A10G On | 00000000:00:1E.0 Off | 0 |
| 0% 35C P0 60W / 300W | 21382MiB / 23028MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 21712 C …s/pytorch_p310/bin/python 15826MiB |
| 1 N/A N/A 21712 C …s/pytorch_p310/bin/python 21494MiB |
| 2 N/A N/A 21712 C …s/pytorch_p310/bin/python 21498MiB |
| 3 N/A N/A 21712 C …s/pytorch_p310/bin/python 21380MiB |
±----------------------------------------------------------------------------+

1 Like

Did you solve this?
I have had the same issue with multiple GPUs. I see that your GPU usage is also quite high considering the model size and same happened in my case.