Hi, I am trying to train Xl-NET and it requires around 14 GB. I have access to 12 GB GPU nodes. However, when I try to train the model using two nodes ( that is 24 GB) the trainer returns Not enough memory error from CUDA. Can you help me to overcome this error?
Usually model training on two GPUs is there to help you get a bigger batch size: what the Trainer and the example scripts do automatically is that each GPU will treat batch of the given --pre_device_train_batch_size
which will result on a training with 2 * per_device_train_batch_size. This still requires the model to fit on each GPU.
What you want to use is model parallelism but this is still very experimental. You can check the deepspeed or fairscale integrations to use ZeRO-DP3 and split your model accross your two GPUs.