How to run an end to end example of distributed data parallel with hugging face's trainer api (ideally on a single node multiple gpus)?

Hi @muellerzr ,

thanks for providing this useful statements. I am using the scenario:

python -m torchrun --nproc_per_node 2 train_xxx.py

which is basically derived from the nlp_example.py

All I actually changed is the tokenize function and the dataset. After starting the script execution,
the model gets downloaded and everything starts properly. nvidia-smi shows that both GPUs are at approx. 80% usage - so far so good.

What worries me now is the fact that the log outputs the things I have been outputting so far, e.g. the size of the dataset etc., twice:

2023-03-31 08:14:23.354 | DEBUG    | __main__:get_dataloaders:79 - DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 7500
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 2500
    })
})
You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.

My question would be how can I make sure that now not the training runs once on each GPU, but actually distributed?

Kind regards
Julian