Hello! I’m using the Trainer API, which is great, to train a causal language model (gpt) using 2 GPUs in parallel.
Now, I have to train multiple similar models for a few epochs at a time, and I would like to iteratively swap out of the GPUs the current Trainer, and swap in the next one. After a while the cycle would restart, so I would need to maintain the state of all the Trainers at all time.
I would like to achieve something like:
while not done: for trainer in trainer_list: move_to_cuda(trainer) for epoch in range(epochs): trainer.train() move_out_of_cuda(trainer)
I would also like to keep using both GPUs to train each model.
Does anyone know how I could move a (multi-GPU) Trainer out of vram?