Longformer on 1 GPU or multi-GPU


Sorry if I duplicate the question. I made some brief search in the forum, but did not really found the
So it was decided to make some fine-tuning of longformer using dataset which consists of 3000 pairs. Length of each input is up to 4096 tokens. After some simple computations understood that it is needed around 24Gb HBM on GPU to run BS=1. I do not have such GPU and I looked on my old 2-socket 20-core Xeon with 64gb of ram.
I installed pytorch optimized by mkldnn for Intel processors… and you know after running I realized that fine-tuning on 3000 pairs will take around 100 hours. 100 hours, Carl! Either this Xeon is too old (only AVX supported) or mkl-dnn does not optimize bert-like pytorch models.

Anyway I’m looking into renting some GPU server. And finally I’m coming to my questions.

Assuming that I need 24gb of memory for 1 batch, then can I take server with 2 GPU with 16 gb each? Do you know if pytorch + cuda can split into 2 GPUs even for batch size = 1 w/o degradation?
Or I need to look for single Nvidia V100 with 32gb of HBM to solve this problem?

Anybody tried already longformer and can share some performance results with details of used HW?


I faced the same problem. As of now, I used a 32 GB GPU (p3dn.24xlarge ec2 instance). Also reduced the number of tokens and the batch size. At present Longformer doesn’t support multiple GPUs. We can shard the training data and train it iteratively by saving the model (still exploring this). We can explore checkpoint feature in pytorch as well (https://pytorch.org/tutorials/recipes/recipes/saving_and_loading_a_general_checkpoint.html). Let me know if you have already found an efficient way to do this.

any progress you noticed on this front , how to enable multi-gpu’s for longformer finetuning. Even though I have 4 gpus enabled , it only takes single gpu during training.

I’ve got it to work with longformer. Are you using Trainer or accelerate?

I am using Trainer

What command did you use to launch the training?

I am just using the trainer.train() inside the notebook. The n_gpus=4 in the Training Args, but it only takes first gpu to train.

  1. You shouldn’t have to specify n_gpus, the Trainer will automatically select all available devices (unless the environment variable CUDA_VISIBLE_DEVICES is set for 1)
  2. Try launching it through a script using this:
python -m torch.distributed.launch \
    --nproc_per_node number_of_gpu_you_have path_to_script.py \