Hello! I’m using the Trainer API, which is great, to train a causal language model (gpt) using 2 GPUs in parallel.
Now, I have to train multiple similar models for a few epochs at a time, and I would like to iteratively swap out of the GPUs the current Trainer, and swap in the next one. After a while the cycle would restart, so I would need to maintain the state of all the Trainers at all time.
I would like to achieve something like:
while not done:
for trainer in trainer_list:
move_to_cuda(trainer)
for epoch in range(epochs):
trainer.train()
move_out_of_cuda(trainer)
I would also like to keep using both GPUs to train each model.
Does anyone know how I could move a (multi-GPU) Trainer out of vram?
edit: typos