Run_ner.py slower on multi-GPU than single GPU

Am i missing something? Why run_ner.py in 3.0.2 is that slower when running on 2 GPUs vs a single GPU ?

(1) 7 minutes for fp16, 1 GPU
(2) 13 minutes for fp16, 2 GPUs
(3) 11 minutes for fp16, python -m torch.distributed.launch --nproc_per_node 2 run_ner.py

(1)

07/13/2020 13:21:49 - INFO - transformers.training_args - PyTorch: setting up devices
07/13/2020 13:21:50 - WARNING - main - Process rank: -1, device: cuda:0, n_gpu: 1, distributed training: False, 16-bits training: True
07/13/2020 13:21:50 - INFO - main - Training/evaluation parameters TrainingArguments(output_dir=’/opt/ml/model/output’, overwrite_output_dir=False, do_train=True, do_eval=True, do_predict=True, evaluate_du
ring_training=False, per_device_train_batch_size=6, per_device_eval_batch_size=8, per_gpu_train_batch_size=None, per_gpu_eval_batch_size=None, gradient_accumulation_steps=1, learning_rate=5e-05, weight_decay=0.0
, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=4, max_steps=-1, warmup_steps=0, logging_dir=’/opt/ml/model/log’, logging_first_step=False, logging_steps=500, save_steps=750, save_total_limit=None, no_
cuda=False, seed=1, fp16=True, fp16_opt_level=‘O1’, local_rank=-1, tpu_num_cores=None, tpu_metrics_debug=False, debug=False, dataloader_drop_last=False, eval_steps=1000, past_index=-1)
07/13/2020 13:21:50 - INFO - transformers.configuration_utils - loading configuration file /opt/program/models/bert-base-multilingual-cased/config.json

Defaults for this optimization level are:
enabled : True
opt_level : O1
cast_model_type : None
patch_torch_functions : True
keep_batchnorm_fp32 : None
master_weights : None
loss_scale : dynamic
Processing user overrides (additional kwargs that are not None)…
After processing overrides, optimization options are:
enabled : True
opt_level : O1
cast_model_type : None
patch_torch_functions : True
keep_batchnorm_fp32 : None
master_weights : None
loss_scale : dynamic
07/13/2020 13:21:58 - INFO - transformers.trainer - ***** Running training *****
07/13/2020 13:21:58 - INFO - transformers.trainer - Num examples = 7239
07/13/2020 13:21:58 - INFO - transformers.trainer - Num Epochs = 4
07/13/2020 13:21:58 - INFO - transformers.trainer - Instantaneous batch size per device = 6
07/13/2020 13:21:58 - INFO - transformers.trainer - Total train batch size (w. parallel, distributed & accumulation) = 6
07/13/2020 13:21:58 - INFO - transformers.trainer - Gradient Accumulation steps = 1
07/13/2020 13:21:58 - INFO - transformers.trainer - Total optimization steps = 4828
07/13/2020 13:21:58 - INFO - transformers.trainer - Starting fine-tuning.

Epoch: 100%|██████████| 4/4 [07:13<00:00, 108.33s/it]
07/13/2020 13:29:12 - INFO - transformers.trainer -

Training completed. Do not forget to share your model on huggingface.co/models =)

(2)

07/13/2020 15:21:33 - INFO - transformers.training_args - PyTorch: setting up devices
07/13/2020 15:21:33 - WARNING - main - Process rank: -1, device: cuda:0, n_gpu: 2, distributed training: False, 16-bits training: True
07/13/2020 15:21:33 - INFO - main - Training/evaluation parameters TrainingArguments(output_dir=’/opt/ml/model/output’, overwrite_output_dir=False, do_train=True, do_eval=True, do_predict=True, evaluate_du
ring_training=False, per_device_train_batch_size=6, per_device_eval_batch_size=8, per_gpu_train_batch_size=None, per_gpu_eval_batch_size=None, gradient_accumulation_steps=1, learning_rate=5e-05, weight_decay=0.0
, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=4, max_steps=-1, warmup_steps=0, logging_dir=’/opt/ml/model/log’, logging_first_step=False, logging_steps=500, save_steps=750, save_total_limit=None, no_
cuda=False, seed=1, fp16=True, fp16_opt_level=‘O1’, local_rank=-1, tpu_num_cores=None, tpu_metrics_debug=False, debug=False, dataloader_drop_last=False, eval_steps=1000, past_index=-1)
07/13/2020 15:21:33 - INFO - transformers.configuration_utils - loading configuration file /opt/program/models/bert-base-multilingual-cased/config.json

enabled : True
opt_level : O1
cast_model_type : None
patch_torch_functions : True
keep_batchnorm_fp32 : None
master_weights : None
loss_scale : dynamic
Processing user overrides (additional kwargs that are not None)…
After processing overrides, optimization options are:
enabled : True
opt_level : O1
cast_model_type : None
patch_torch_functions : True
keep_batchnorm_fp32 : None
master_weights : None
loss_scale : dynamic
07/13/2020 15:21:57 - INFO - transformers.trainer - ***** Running training *****
07/13/2020 15:21:57 - INFO - transformers.trainer - Num examples = 7239
07/13/2020 15:21:57 - INFO - transformers.trainer - Num Epochs = 4
07/13/2020 15:21:57 - INFO - transformers.trainer - Instantaneous batch size per device = 6
07/13/2020 15:21:57 - INFO - transformers.trainer - Total train batch size (w. parallel, distributed & accumulation) = 12
07/13/2020 15:21:57 - INFO - transformers.trainer - Gradient Accumulation steps = 1
07/13/2020 15:21:57 - INFO - transformers.trainer - Total optimization steps = 2416
07/13/2020 15:21:57 - INFO - transformers.trainer - Starting fine-tuning.

Epoch: 100%|██████████| 4/4 [13:16<00:00, 199.08s/it]
07/13/2020 15:35:14 - INFO - transformers.trainer -

Training completed. Do not forget to share your model…

(3)

Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.


07/13/2020 15:47:16 - INFO - transformers.training_args - PyTorch: setting up devices
07/13/2020 15:47:16 - WARNING - main - Process rank: 1, device: cuda:1, n_gpu: 1, distributed training: True, 16-bits training: True
07/13/2020 15:47:16 - WARNING - main - Process rank: 0, device: cuda:0, n_gpu: 1, distributed training: True, 16-bits training: True
07/13/2020 15:47:16 - INFO - main - Training/evaluation parameters TrainingArguments(output_dir=’/opt/ml/model/output’, overwrite_output_dir=False, do_train=True, do_eval=True, do_predict=True, evaluate_du
ring_training=False, per_device_train_batch_size=6, per_device_eval_batch_size=8, per_gpu_train_batch_size=None, per_gpu_eval_batch_size=None, gradient_accumulation_steps=1, learning_rate=5e-05, weight_decay=0.0
, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=4.0, max_steps=-1, warmup_steps=0, logging_dir=’/opt/ml/model/log’, logging_first_step=False, logging_steps=500, save_steps=750, save_total_limit=None, n
o_cuda=False, seed=1, fp16=True, fp16_opt_level=‘O1’, local_rank=0, tpu_num_cores=None, tpu_metrics_debug=False, debug=False, dataloader_drop_last=False, eval_steps=1000, past_index=-1)
07/13/2020 15:47:16 - INFO - transformers.configuration_utils - loading configuration file /opt/program/models/bert-base-multilingual-cased/config.json

Defaults for this optimization level are:
enabled : True
opt_level : O1
cast_model_type : None
patch_torch_functions : True
keep_batchnorm_fp32 : None
master_weights : None
loss_scale : dynamic
Processing user overrides (additional kwargs that are not None)…
After processing overrides, optimization options are:
enabled : True
opt_level : O1
cast_model_type : None
patch_torch_functions : True
keep_batchnorm_fp32 : None
master_weights : None
loss_scale : dynamic
07/13/2020 15:47:41 - WARNING - transformers.trainer - You are instantiating a Trainer but Tensorboard is not installed. You should consider installing it.
07/13/2020 15:47:41 - INFO - transformers.trainer - ***** Running training *****
07/13/2020 15:47:41 - INFO - transformers.trainer - Num examples = 7239
07/13/2020 15:47:41 - INFO - transformers.trainer - Num Epochs = 4
07/13/2020 15:47:41 - INFO - transformers.trainer - Instantaneous batch size per device = 6
07/13/2020 15:47:41 - INFO - transformers.trainer - Total train batch size (w. parallel, distributed & accumulation) = 12
07/13/2020 15:47:41 - INFO - transformers.trainer - Gradient Accumulation steps = 1
07/13/2020 15:47:41 - INFO - transformers.trainer - Total optimization steps = 2416
07/13/2020 15:47:41 - INFO - transformers.trainer - Starting fine-tuning.

Epoch: 100%|██████████| 4/4 [11:13<00:00, 168.32s/it]
07/13/2020 15:58:54 - INFO - transformers.trainer -

Training completed. Do not forget to share your model…

It might be that since local_rank =-1 for the 2nd setting, only one gpu was still being used.