CUDA OOM while saving the model

I am trying to finetune FLAN-t5-XXL using PEFT’s LORA method.

Training Details -

Dataset size = 6k records,
instance_type = AWS's ml.g5.16xlarge.
batch_size = 2,
gradient_accumulation_steps = 2
learning_rate = 1e-3,
num_train_epochs = 1 # Want to change it to 3 and more but chose 1 for experimenting if training completes or not

Training completes with this output -
{'train_runtime': 1364.2004, 'train_samples_per_second': 0.733, 'train_steps_per_second': 0.183, 'train_loss': 1.278140380859375, 'epoch': 1.0}

But getting CUDA OOM at the point when trying to save the model by trainer.save_model call.

Details of error -
ErrorMessage “OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB (GPU 0; 22.19
GiB total capacity; 20.34 GiB already allocated; 32.50 MiB free; 20.96 GiB
reserved in total by PyTorch) If reserved memory is >> allocated memory try
setting max_split_size_mb to avoid fragmentation. See documentation for Memory
Management and PYTORCH_CUDA_ALLOC_CONF”

Could anyone help me in sorting this out?

Aastha

Hi @sgugger ! Do you have any suggestion on how to solve this error ?

Found a solution - CUDA OOM error while saving the model · Issue #16 · philschmid/deep-learning-pytorch-huggingface · GitHub