Repeated training runs out of GPU memory

tillspeicher · July 22, 2024, 3:24pm

Hi, I’m trying to repeatedly train a Llama-2-13B model on different datasets using the HF Trainer, but I’m running out of memory on repeated runs.

Detailed problem description:

I train the model on one dataset using one Trainer instance, then create another Trainer and train it on another dataset, basically

model = ...
for data in datasets:
    trainer = Trainer(..., model=model, train_dataset=data)

The first training run works fine, but the second training run runs out of GPU memory during the first training step:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 48.49 GiB. GPU 0 has a total capacity of 79.14 GiB of which 28.59 GiB is free. Process 1380018 has 50.29 GiB memory in use. Of the allocated memory 49.19 GiB is allocated by PyTorch, and 12.48 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Presumably there is some sort of memory leak.

torch.cuda.mem_get_info() shows that after the first training run, i.e. when the second run starts, about 27GB of GPU memory are allocated.
I suspect that these are the model parameters still loaded onto the GPU (13B x 2 bytes (bfloat16) ~= 27 GB).
My assumption is that this should be fine, since I anyway want the model to be on the GPU for the next training run.

Solution attempts so far

One guess is that the trainer somehow leaks GPU memory.
I tried to delete it and free up memory as follows:

import gc
import torch

model = ...
for data in datasets:
    trainer = Trainer(..., model=model, train_dataset=data)
    
    del trainer
    gc.collect()
    torch.cuda.empty_cache()

but I’m still getting the same OOM error.

I also tried moving the model to the CPU in between training runs to free up memory via model = model.to("cpu"); torch.cuda.empty_cache(), but this also didn’t change the memory issues.

Deleting and reloading the model form disk also does not seem to work:

del model
torch.cuda.empty_cache()
model = ... # reload

I also tried re-using the same trainer instance by just changing the train_dataset attribute of the trainer after each run, with the same OOM result.

My last hacky attempt was to try to run the training code in a sub-process that would free up its resources when it terminates and then start another process, but if I do that the training loop gets stuck, presumably because of a deadlock.

Additional information

I’m running on a single A100 with 80 GB of memory using batch size 1 and deepspeed stage 2 with CPU offloading.
All datasets have exactly the same size and sequence length.

Questions

Has anyone encountered similar issues before or managed to make repeated training runs work?
Could the Trainer or the model be leaking memory?
Could the problem be deepspeed-related?
Thank you!

vergilus · October 16, 2024, 6:54am

Did you use deepspeed? I run into the OOM during repeated Transformers’ trainer launching.
between training epochs I tried to gc and free torch cuda, but some overheads persists. I turn to the lora training without deepspeed, and problem solved, indicating I fail to release the deepspeed overhead between each run.

tillspeicher · December 16, 2024, 12:26pm

Thanks vergilus, and sorry for the late response.
Yes I am using deepspeed and the issue might be related to that, as you are suggesting.

I ended up starting each training run in a new subprocess, saving the model after each run, and then loading it in the main process at the end. That way, all the memory associated with the training processes got freed at the end and I was able to run the script.
This is only a workaround of course, since it adds some overhead and makes the code somewhat clunky, but it got the job done for me.

Pankaj8922 · December 16, 2024, 12:43pm

try Restartarting your Runtime.

Topic		Replies	Views
Always getting RuntimeError: CUDA out of memory with Trainer 🤗Transformers	10	6963	April 4, 2024
Transformer's trainer runtime error 🤗Transformers	1	111	December 5, 2024
Trainer leaked memory? DeepSpeed	1	787	October 15, 2024
Training out of memory 🤗Transformers	0	234	July 18, 2024
Is there a way to terminate llm.generate and release the GPU memory for next prompt? DeepSpeed	1	183	February 4, 2025

Repeated training runs out of GPU memory

Detailed problem description:

Solution attempts so far

Additional information

Questions

Related topics