Training using multiple GPUs

@sgugger I am trying to test multi-gpu training with the HF Trainer but for training a third party pytorch model. I have already overridden the compute_loss and the Trainer.train() runs without a problem on single GPU machines. On a 4-GPU EC2 machine I get the following error:

TrainerCallback
0%| | 0/20000 [00:00<?, ?it/s]Traceback (most recent call last):
File “train_hf_mlm_encoder_single_gpu.py”, line 222, in
main(params_dict)
File “train_hf_mlm_encoder_single_gpu.py”, line 218, in main
custom_trainer.train()
File “/home/a204311-DataScientist/anaconda3/envs/routing/lib/python3.6/site-packages/transformers/trainer.py”, line 1053, in train
tr_loss += self.training_step(model, inputs)
File “/home/a204311-DataScientist/anaconda3/envs/routing/lib/python3.6/site-packages/transformers/trainer.py”, line 1443, in training_step
loss = self.compute_loss(model, inputs)
File “/home/a204311-DataScientist/projects/trlabs_routing_transformer/routing_sum/mlm_pretrain/train_and_eval.py”, line 121, in compute_loss
return_loss=True
File “/home/a204311-DataScientist/anaconda3/envs/routing/lib/python3.6/site-packages/torch/nn/modules/module.py”, line 550, in call
result = self.forward(*input, **kwargs)
File “/home/a204311-DataScientist/anaconda3/envs/routing/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py”, line 154, in forward
replicas = self.replicate(self.module, self.device_ids[:len(inputs)])
File “/home/a204311-DataScientist/anaconda3/envs/routing/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py”, line 159, in replicate
return replicate(module, device_ids, not torch.is_grad_enabled())
File “/home/a204311-DataScientist/anaconda3/envs/routing/lib/python3.6/site-packages/torch/nn/parallel/replicate.py”, line 104, in replicate
buffer_copies_not_rg = _broadcast_coalesced_reshape(buffers_not_rg, devices, detach=True)
File “/home/a204311-DataScientist/anaconda3/envs/routing/lib/python3.6/site-packages/torch/nn/parallel/replicate.py”, line 68, in _broadcast_coalesced_reshape
return comm.broadcast_coalesced(tensors, devices)
File “/home/a204311-DataScientist/anaconda3/envs/routing/lib/python3.6/site-packages/torch/cuda/comm.py”, line 39, in broadcast_coalesced
return torch._C._broadcast_coalesced(tensors, devices, buffer_size)
RuntimeError: Unconvertible NCCL type
0%|

Any hints what may be causing this? I was under the impression that multi-GPU training should work out of the box with the Huggingface Trainer. Thank you for your help.