CUDA out of memory on multi-GPU

graeme204 · August 30, 2023, 9:21am

Hi,

I’m training a Llama2 7bn (“meta-llama/Llama-2-7b-hf”) on a ml.g5.24xlarge (96 GiB GPU split across 4 GPUs). I’m using 4bit quant and my batch size is 1. It was working fine until recently but suddenly started throwing some CUDA out of memory issues without me changing (as far as I know!) any meaningful parameter.

Can anyone help? Posting it here but let me know if there’s a better place (or if it’s a PyTorch issue!)
Thanks

Error message:
OutOfMemoryError: CUDA out of memory. Tried to allocate 8.58 GiB (GPU 0; 22.19 GiB total capacity; 11.81 GiB already allocated; 6.73 GiB free; 14.40 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Code extract: (note that a CustomerTrainer is used for making a slight mod to the loss function)
trainer = CustomTrainer(
model=model,
train_dataset=dataset_train,
eval_dataset=dataset_val,
callbacks=[CustomCallback],
args=CustomTrainingArguments(
per_device_train_batch_size=1,
gradient_accumulation_steps=4,
warmup_steps=2,
num_train_epochs = 3,
learning_rate=2e-4,
fp16=True,
logging_steps=1,
output_dir=output_dir,
optim=“paged_adamw_8bit”,
evaluation_strategy = “steps”,
eval_accumulation_steps = 4,
resume_from_checkpoint = resume_from_checkpoint,
report_to=“wandb”,
run_name=wandb_run_name # name of the W&B run (optional)
),
data_collator=CustomDataCollator(tokenizer, mlm=False),
)

±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 21712 C …s/pytorch_p310/bin/python 15826MiB |
| 1 N/A N/A 21712 C …s/pytorch_p310/bin/python 21494MiB |
| 2 N/A N/A 21712 C …s/pytorch_p310/bin/python 21498MiB |
| 3 N/A N/A 21712 C …s/pytorch_p310/bin/python 21380MiB |
±----------------------------------------------------------------------------+

Manpa · March 6, 2024, 8:15am

Did you solve this?
I have had the same issue with multiple GPUs. I see that your GPU usage is also quite high considering the model size and same happened in my case.

Topic		Replies	Views
torch.cuda.OutOfMemoryError 🤗Transformers	0	2060	July 5, 2023
CUDA OUT OF MEMORY on MULTI GPU 🤗Transformers	0	720	February 28, 2024
CUDA Out-of-Memory Error with llama2-13b-chat Model on Multi-GPU Server 🤗Transformers	0	1149	December 5, 2023
Colab CUDA OOM using Llama-2-7b-chat-hf even with 40GPU RAM 🤗Transformers	0	909	December 29, 2023
Setting PyTorch CUDA memory configuration while using HF transformers 🤗Transformers	1	3225	November 23, 2022

CUDA out of memory on multi-GPU

Related topics