Boilerplate for Trainer using torch.distributed

nielsr · January 11, 2022, 11:11am

Hi, thanks for your interest in HuggingFace and my notebook! I assume that all of this should work out-of-the-box in a distributed environment. The Trainer itself instantiates the model and creates dataloaders internally. You can for instance provide the number of workers you want it to use when creating the dataloaders, by specifying the dataloader_num_workers argument in TrainingArguments.

You just need to use the PyTorch launcher to properly launch a multi-GPU multinode training. Example:

python -m torch.distributed.launch --nproc_per_node 8 \
    --nnodes 2 \
    --node_rank rank_of_your_machine \
    --master_addr main_machine_ip \
    --master_port open_port_on_main_machine \
    run_mlm.py \
    --sharded_ddp \
    --all_other_args_to_script

Topic		Replies	Views
How to run an end to end example of distributed data parallel with hugging face's trainer api (ideally on a single node multiple gpus)? Intermediate	17	17938	September 6, 2023
How to run single-node, multi-GPU training with HF Trainer? 🤗Transformers	5	15225	October 16, 2024
Multi gpu training 🤗Transformers	3	6021	April 24, 2022
Training using multiple GPUs Beginners	20	20107	February 25, 2024
Single Node Multi GPU FlanT5 fine-tuning using HF Dataset and HF Trainer 🤗Transformers	4	2059	July 5, 2023

Boilerplate for Trainer using torch.distributed

Related topics