Boilerplate for Trainer using torch.distributed

Hi everyone,

Thanks in advance for helping! First I just have to say, as my first post here, Huggingface is awesome. We have been using the tools/libraries for a while for NLP work and it is just a pleasure to use and so applicable to real-world problems!!!

We are “graduating” if you will from single GPU to multi-GPU models/datasets. Looking through different platforms / libraries to do this. I think overall for our applications, we don’t need to customize our training loops - the Huggingface Trainer is our bread and butter (with the exception of ~5% of applications where we do our own Pytorch training loops). So my question is - for the Huggingface trainer - is there some boilerplate code that works using torch.distributed? I understand that - at a basic level - torch.distributed launches the same code a bunch of times and you need to know which process instance you are running. My understanding is the trainer itself handles all of this for you - but what about when you instantiate the model? When you instantiate your datasets? Etc. What is the bare minimum you need to do to get a Trainer working in a torch.distributed environment? The examples I have found thusfar are pretty heavy - contain a lot of code to parse your arguments, etc. (Don’t get me wrong - they are awesome and well documented - again this is all kind of too good to be true).

What we’re trying to do is a large ViT to Text model based on this:

Trying to use the Seq2Seq trainer - but using multi-node, multi-GPU for a very large dataset and much higher resolution images.

Any pointers to some simple examples would be much appreciated!!

Hi, thanks for your interest in HuggingFace and my notebook! I assume that all of this should work out-of-the-box in a distributed environment. The Trainer itself instantiates the model and creates dataloaders internally. You can for instance provide the number of workers you want it to use when creating the dataloaders, by specifying the dataloader_num_workers argument in TrainingArguments.

You just need to use the PyTorch launcher to properly launch a multi-GPU multinode training. Example:

python -m torch.distributed.launch --nproc_per_node 8 \
    --nnodes 2 \
    --node_rank rank_of_your_machine \
    --master_addr main_machine_ip \
    --master_port open_port_on_main_machine \ \
    --sharded_ddp \

Thank you so much for the quick reply - sorry I think I phrased it poorly. In - there are 500+ lines of code. We have our own python scripts that load some data, pass it to a Huggingface trainer, etc. very similar to the Seq2SeqTrainer notebook I have linked above. These are like 10 lines of code… for one GPU. My question is, to convert a very basic Huggingface python script (loads a model, loads a dataset, passes the dataset and model to the trainer)… what is the bare minimum to “convert” this code, which runs on one GPU using huggingface trainer, to multi gpu using huggingface trainer? Said differently… if you take your that you linked to… is there a much simpler version of that somewhere that doesn’t have all the argument checking, etc… just the bare minimum to train something on multi-GPU using torch.distributed and a huggingface trainer?

Going more into the specifics - here is the bulk of the code I am trying to convert to multi GPU:

trainer = transformers.Seq2SeqTrainer(


My specific questions - understanding that when I run this using the distributed launcher, it is running 8 of these (for 8 GPUs):

  1. My “model” object is created above that code using from_encoder_decoder_pretrained - if it is running 8 times… is that making 8 copies of the model or is the huggingface function from_encoder_decoder_pretrained handling that for me?

  2. my tokenizer is created similarly using the from_pretrained function from Huggingface - is it making 8 copies or does it deal with it?

  3. train_dataset & eval_dataset is a very simple dataset class I wrote myself - it just gets a row out of a dataframe that contains a file name, and then loads the file, and returns the file. I am not worried about 8 of these running in parallel - they are thread safe because they are just reading files - but… same question… are there 8 of these running at the same time? Should I care and try and make only one of them?

Hopefully that makes more sense and explains why I am looking for some basic boilerplate. Please by all means push back and correct me!!! Have learned a lot!!

The Trainer code will run on distributed or one GPU without any change. Regarding your other questions:

  • you need to define your model in all processes, they will see different part of the data each and all copies will be kept the same.
  • the tokenizer and dataset preprocessing can either be done on all processes if it doesn’t slow you down, or you can use the with training_args.main_process_first(desc="dataset map pre-processing"): context manager to make sure the preprocessing is done only on process 0.

that is exactly what i needed to hear - thank you so much for your reply!!!