Repeated training runs out of GPU memory

Hi, I’m trying to repeatedly train a Llama-2-13B model on different datasets using the HF Trainer, but I’m running out of memory on repeated runs.

Detailed problem description:

I train the model on one dataset using one Trainer instance, then create another Trainer and train it on another dataset, basically

model = ...
for data in datasets:
    trainer = Trainer(..., model=model, train_dataset=data)

The first training run works fine, but the second training run runs out of GPU memory during the first training step:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 48.49 GiB. GPU 0 has a total capacity of 79.14 GiB of which 28.59 GiB is free. Process 1380018 has 50.29 GiB memory in use. Of the allocated memory 49.19 GiB is allocated by PyTorch, and 12.48 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Presumably there is some sort of memory leak.

torch.cuda.mem_get_info() shows that after the first training run, i.e. when the second run starts, about 27GB of GPU memory are allocated.
I suspect that these are the model parameters still loaded onto the GPU (13B x 2 bytes (bfloat16) ~= 27 GB).
My assumption is that this should be fine, since I anyway want the model to be on the GPU for the next training run.

Solution attempts so far

One guess is that the trainer somehow leaks GPU memory.
I tried to delete it and free up memory as follows:

import gc
import torch

model = ...
for data in datasets:
    trainer = Trainer(..., model=model, train_dataset=data)
    
    del trainer
    gc.collect()
    torch.cuda.empty_cache()

but I’m still getting the same OOM error.

I also tried moving the model to the CPU in between training runs to free up memory via model = model.to("cpu"); torch.cuda.empty_cache(), but this also didn’t change the memory issues.

Deleting and reloading the model form disk also does not seem to work:

del model
torch.cuda.empty_cache()
model = ... # reload

I also tried re-using the same trainer instance by just changing the train_dataset attribute of the trainer after each run, with the same OOM result.

My last hacky attempt was to try to run the training code in a sub-process that would free up its resources when it terminates and then start another process, but if I do that the training loop gets stuck, presumably because of a deadlock.

Additional information

I’m running on a single A100 with 80 GB of memory using batch size 1 and deepspeed stage 2 with CPU offloading.
All datasets have exactly the same size and sequence length.

Questions

Has anyone encountered similar issues before or managed to make repeated training runs work?
Could the Trainer or the model be leaking memory?
Could the problem be deepspeed-related?
Thank you!