How to run an end to end example of distributed data parallel with hugging face's trainer api (ideally on a single node multiple gpus)?

muellerzr · August 17, 2022, 7:31pm

In Accelerate we report only on the main process, as shown here: transformers/examples/pytorch/text-classification/run_glue_no_trainer.py at main · huggingface/transformers · GitHub

For non-accelerate, I think it’d look something like:

if int(os.environ.get("LOCAL_RANK", -1)) == 0:
  # do any wandb thing on a single node

and do that every single time.

Note this is torch.distributed.run/launch specific, so it won’t work for TPUs in this way. But you need to check that you are on the local node/main node.

Since you’re very interested in the inner workings like this, would recommend giving Accelerate a peek if possible. Sounds like you may like what we’ve done there

(Though transformers does not use Accelerate internally right now for this type of stuff)

Topic		Replies	Views
Distributed training large models on cloud resources Beginners	6	859	March 27, 2024
Single Node Multi GPU FlanT5 fine-tuning using HF Dataset and HF Trainer 🤗Transformers	4	2089	July 5, 2023
Using Transformers with DistributedDataParallel — any examples? Intermediate	11	23784	May 8, 2023
Multi gpu training 🤗Transformers	3	6080	April 24, 2022
Training using multiple GPUs Beginners	20	20272	February 25, 2024

How to run an end to end example of distributed data parallel with hugging face's trainer api (ideally on a single node multiple gpus)?

Related topics