LM example run_clm.py isn't distributing data across multiple GPUs as expected

EDIT: I think the missing piece is the -m torch.distributed.launch flag from the terminal command. I will test this when I get a chance and update the thread if that’s the fix.

I am fine-tuning GPT-2 using examples/language-modeling/run_clm.py. It seems like the Trainer class instantiated in it will by default wrap the model in Distributed Data Parallel and spread it across the 4 gpus that I am providing it when I include CUDA_VISIBLE_DEVICES=0,1,2,3 at call time.

However, when I run nvidia-smi only gpu:0 is being used. The first line the script prints is

01/16/2021 02:39:40 - WARNING - __main__ -   Process rank: -1, device: cuda:0, n_gpu: 0 distributed training: False, 16-bits training: False

I understand that for the model using DDP, n_gpu should be 1? I think the print of n_gpu=0 is just a result of n_gpu not actually being available as a flag for configuration.

The next line I get is

01/16/2021 02:39:40 - INFO - __main__ -   Training/evaluation parameters Trainin
gArguments(output_dir=/data/saxon/nlp_abs_full/test, overwrite_output_dir=True,
do_train=True, do_eval=False, do_predict=False, evaluation_strategy=EvaluationSt
rategy.NO, prediction_loss_only=False, per_device_train_batch_size=5, per_device
_eval_batch_size=8, gradient_accumulation_steps=1, eval_accumulation_steps=None,
 learning_rate=5e-05, weight_decay=0.0, adam_beta1=0.9, adam_beta2=0.999, adam_e
psilon=1e-08, max_grad_norm=1.0, num_train_epochs=3.0, max_steps=-1, lr_schedule
r_type=SchedulerType.LINEAR, warmup_steps=0, logging_dir=runs/Jan16_02-39-40_and
rew.cs, logging_first_step=False, logging_steps=500, save_steps=500, save_total_
limit=3, no_cuda=False, seed=42, fp16=False, fp16_opt_level=O1, fp16_backend=aut
o, local_rank=-1, tpu_num_cores=None, tpu_metrics_debug=False, debug=False, data
loader_drop_last=False, eval_steps=500, dataloader_num_workers=0, past_index=-1,
 run_name=/data/saxon/nlp_abs_full/test, disable_tqdm=False, remove_unused_colum
ns=True, label_names=None, load_best_model_at_end=False, metric_for_best_model=N
one, greater_is_better=None, ignore_data_skip=False, sharded_ddp=False, deepspee
d=None, label_smoothing_factor=0.0, adafactor=False, _n_gpu=4)

It seems like what’s happening here is somewhere in the process of setting up the trainer n_gpu is defaulting to 4 (the number that I have), which I think is somehow interrupting the process that is supposed to happen where the trainer wraps the model in DDP.

To try fixing this I added the line trainer._n_gpu = 1 to force the value to the argument that it should be for DDP according to the documentation

However, this does not fix the problem, so I’m stuck. I think this might be a bug in the example script, because the expected behavior, if I understand right, is that training should automatically use DDP when more than 1 gpu is available.

Am I doing something wrong? The full command I’m running is

TRANSFORMERS_CACHE=/data/saxon/cache CUDA_VISIBLE_DEVICES=0,1,2,3 python run_clm.py --model_name_or_path gpt2 --train_file /data/saxon/nlp_abs_full/ffl.txt --do_train --output_dir /data/saxon/nlp_abs_full/test --per_device_train_batch_size 5 --cache_dir /data/saxon/cache --save_total_limit 3 --save_steps 500 --overwrite_output_dir

If you want to use DDP (distributed data parallel) you do need to launch the script with python -m torch.distributed.launch.

@sgugger Just to clarify one thing, when launching script with python -m torch.distributed.launch --nproc_per_node=8 script.py. it will be a DDP training, and there is no need to set n_gpu. It became confusing to me, as it was logging warning you are setting n_gpu to 1 per node.

There is no n_gpu argument you can set, so I’m confused about your question.

this line gave me n_gpu is 1 while I am using ddp with 8 GPU. So, I was confused.
As you said there is no n_gpu to set, probably this warning is normal.

does this solve your question: Using Transformers with DistributedDataParallel — any examples?