LM example run_clm.py isn't distributing data across multiple GPUs as expected

saxon · January 16, 2021, 10:38pm

EDIT: I think the missing piece is the -m torch.distributed.launch flag from the terminal command. I will test this when I get a chance and update the thread if that’s the fix.

I am fine-tuning GPT-2 using examples/language-modeling/run_clm.py. It seems like the Trainer class instantiated in it will by default wrap the model in Distributed Data Parallel and spread it across the 4 gpus that I am providing it when I include CUDA_VISIBLE_DEVICES=0,1,2,3 at call time.

However, when I run nvidia-smi only gpu:0 is being used. The first line the script prints is

01/16/2021 02:39:40 - WARNING - __main__ -   Process rank: -1, device: cuda:0, n_gpu: 0 distributed training: False, 16-bits training: False

I understand that for the model using DDP, n_gpu should be 1? I think the print of n_gpu=0 is just a result of n_gpu not actually being available as a flag for configuration.

The next line I get is

01/16/2021 02:39:40 - INFO - __main__ -   Training/evaluation parameters Trainin
gArguments(output_dir=/data/saxon/nlp_abs_full/test, overwrite_output_dir=True,
do_train=True, do_eval=False, do_predict=False, evaluation_strategy=EvaluationSt
rategy.NO, prediction_loss_only=False, per_device_train_batch_size=5, per_device
_eval_batch_size=8, gradient_accumulation_steps=1, eval_accumulation_steps=None,
 learning_rate=5e-05, weight_decay=0.0, adam_beta1=0.9, adam_beta2=0.999, adam_e
psilon=1e-08, max_grad_norm=1.0, num_train_epochs=3.0, max_steps=-1, lr_schedule
r_type=SchedulerType.LINEAR, warmup_steps=0, logging_dir=runs/Jan16_02-39-40_and
rew.cs, logging_first_step=False, logging_steps=500, save_steps=500, save_total_
limit=3, no_cuda=False, seed=42, fp16=False, fp16_opt_level=O1, fp16_backend=aut
o, local_rank=-1, tpu_num_cores=None, tpu_metrics_debug=False, debug=False, data
loader_drop_last=False, eval_steps=500, dataloader_num_workers=0, past_index=-1,
 run_name=/data/saxon/nlp_abs_full/test, disable_tqdm=False, remove_unused_colum
ns=True, label_names=None, load_best_model_at_end=False, metric_for_best_model=N
one, greater_is_better=None, ignore_data_skip=False, sharded_ddp=False, deepspee
d=None, label_smoothing_factor=0.0, adafactor=False, _n_gpu=4)

It seems like what’s happening here is somewhere in the process of setting up the trainer n_gpu is defaulting to 4 (the number that I have), which I think is somehow interrupting the process that is supposed to happen where the trainer wraps the model in DDP.

To try fixing this I added the line trainer._n_gpu = 1 to force the value to the argument that it should be for DDP according to the documentation

However, this does not fix the problem, so I’m stuck. I think this might be a bug in the example script, because the expected behavior, if I understand right, is that training should automatically use DDP when more than 1 gpu is available.

Am I doing something wrong? The full command I’m running is

TRANSFORMERS_CACHE=/data/saxon/cache CUDA_VISIBLE_DEVICES=0,1,2,3 python run_clm.py --model_name_or_path gpt2 --train_file /data/saxon/nlp_abs_full/ffl.txt --do_train --output_dir /data/saxon/nlp_abs_full/test --per_device_train_batch_size 5 --cache_dir /data/saxon/cache --save_total_limit 3 --save_steps 500 --overwrite_output_dir

sgugger · January 19, 2021, 9:10pm

If you want to use DDP (distributed data parallel) you do need to launch the script with python -m torch.distributed.launch.

sadra · March 3, 2022, 11:52pm

@sgugger Just to clarify one thing, when launching script with python -m torch.distributed.launch --nproc_per_node=8 script.py. it will be a DDP training, and there is no need to set n_gpu. It became confusing to me, as it was logging warning you are setting n_gpu to 1 per node.

sgugger · March 4, 2022, 2:27pm

There is no n_gpu argument you can set, so I’m confused about your question.

sadra · March 10, 2022, 8:37pm

this line gave me n_gpu is 1 while I am using ddp with 8 GPU. So, I was confused.
As you said there is no n_gpu to set, probably this warning is normal.

brando · August 17, 2022, 3:03pm

does this solve your question: Using Transformers with DistributedDataParallel — any examples?

SUNM · May 17, 2023, 5:18am

Hi @saxon , I hope you are well. sorry, I want to use the model for fine tunning the gpt2. my question is that how you send the data to the model, what is ff.txt and how it is organized? is it the whole data that you used I mean both training and validation? my dataset is two csv files that contain training and validation sentences.I would appreciate if you please help me to know how I can pass the data. many thanks

saxon · May 17, 2023, 6:30am

It’s been a few years since I did this project but I believe ffl.txt was just a plaintext file containing natural text that I was fine-tuning GPT2 to generate. This is just the training text and I believe it can be formatted arbitrarily. There might be flags for setting test, etc, try checking the documentation for run_clm.py and looking at the options transformers/run_clm.py at main · huggingface/transformers · GitHub

SUNM · May 17, 2023, 6:33am

@saxon , thanks for your answer. you means that it is a txt file including the sentences each line one sample? I read the code it is a bit confusing fo rme

SUNM · May 17, 2023, 6:35am

@saxon , can i use it for GPT-Neo as well? I think it is a general code

SUNM · May 17, 2023, 6:48am

HI @sgugger , I hope you are well. I want to use run_clm.py code, is it all right to run it in this way. my target model is gpt-neo. and about the data is it right to use all data in a txt file, each row one sentence? should each sentence separated by “.” at the end? or no “.” needed at the end.

CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node 2 run_clm.py --model_name_or_path ./nlp/gpt_neo/ --train_file /nlp_abs_full/ffl.txt --do_train --output_dir /data/nlp_abs_full/test
–per_device_train_batch_size 5 --save_total_limit 3 --save_steps 500

Topic		Replies	Views
How to run an end to end example of distributed data parallel with hugging face's trainer api (ideally on a single node multiple gpus)? Intermediate	17	17844	September 6, 2023
Running a Trainer in DistributedDataParallel mode 🤗Transformers	1	1449	October 24, 2020
Multi gpu training 🤗Transformers	3	6014	April 24, 2022
Distribute training 🤗Transformers	0	314	November 16, 2022
How to run the Causal Language modelling example on multiple gpu? 🤗Transformers	0	81	September 16, 2024

LM example run_clm.py isn't distributing data across multiple GPUs as expected

Related topics