How to run an end to end example of distributed data parallel with hugging face's trainer api (ideally on a single node multiple gpus)?

brando · August 17, 2022, 6:07pm

muellerzr:

Now I’m assuming we’re running this through a clone of the repo, so the args will be setup similar to how the tests are done:
torchrun --nnodes 2 examples/pytorch/text-classification/run_glue.py --model_name_or_path distilbert-base-uncased --output_dir outputs --train_file ./tests/fixtures/tests_samples/MRPC/train.csv --validation_file ./tests/fixtures/tests_samples/MRPC/dev.csv --do_train --do_eval --per_device_train_batch_size=2 --per_device_eval_batch_size=1
And that’s all it takes, just launch it like normal via torchrun!

ok I must admit it didn’t occur to me to just run my normal script by appending torchrun --nnodes 2 ... or python -m torch.distributed.launch --nproc_per_node 2 main_data_parallel_ddp_pg.py (with the more familiar/older way to launch for me).

I will launch it as you suggested and track the nvidia-smi usage.

I assume that somehow the processes are in communication and know by coordination when a epoch has truly ended.

Thanks that was useful.

fyi in case torchrun doesn’t work for you:

#pip install torch==1.9.1+cu111 torchvision==0.10.1+cu111 torchaudio==0.9.1 -f https://download.pytorch.org/whl/torch_stable.html
pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu113

or if you want upgrade:

pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu113
pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu113 --upgrade

Topic		Replies	Views
Distributed training large models on cloud resources Beginners	6	859	March 27, 2024
Single Node Multi GPU FlanT5 fine-tuning using HF Dataset and HF Trainer 🤗Transformers	4	2089	July 5, 2023
Using Transformers with DistributedDataParallel — any examples? Intermediate	11	23784	May 8, 2023
Multi gpu training 🤗Transformers	3	6080	April 24, 2022
Training using multiple GPUs Beginners	20	20272	February 25, 2024

How to run an end to end example of distributed data parallel with hugging face's trainer api (ideally on a single node multiple gpus)?

Related topics