I am trying to finetune FLAN-t5-XXL using PEFT’s LORA method.
Training Details -
Dataset size = 6k records,
instance_type = AWS's ml.g5.16xlarge.
batch_size = 2,
gradient_accumulation_steps = 2
learning_rate = 1e-3,
num_train_epochs = 1 # Want to change it to 3 and more but chose 1 for experimenting if training completes or not
Training completes with this output -
{'train_runtime': 1364.2004, 'train_samples_per_second': 0.733, 'train_steps_per_second': 0.183, 'train_loss': 1.278140380859375, 'epoch': 1.0}
But getting CUDA OOM at the point when trying to save the model by trainer.save_model
call.
Details of error -
ErrorMessage “OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB (GPU 0; 22.19
GiB total capacity; 20.34 GiB already allocated; 32.50 MiB free; 20.96 GiB
reserved in total by PyTorch) If reserved memory is >> allocated memory try
setting max_split_size_mb to avoid fragmentation. See documentation for Memory
Management and PYTORCH_CUDA_ALLOC_CONF”
Could anyone help me in sorting this out?
Aastha