Hi, thanks for your interest in HuggingFace and my notebook! I assume that all of this should work out-of-the-box in a distributed environment. The Trainer itself instantiates the model and creates dataloaders internally. You can for instance provide the number of workers you want it to use when creating the dataloaders, by specifying the dataloader_num_workers argument in TrainingArguments
.
You just need to use the PyTorch launcher to properly launch a multi-GPU multinode training. Example:
python -m torch.distributed.launch --nproc_per_node 8 \
--nnodes 2 \
--node_rank rank_of_your_machine \
--master_addr main_machine_ip \
--master_port open_port_on_main_machine \
run_mlm.py \
--sharded_ddp \
--all_other_args_to_script