How to run an end to end example of distributed data parallel with hugging face's trainer api (ideally on a single node multiple gpus)?

Hi @muellerzr , I’m curios about how Trainer works. After I look at the script, I found that when saving model at checkpoint, the script didn’t use local_rank argument to make the script only saving model on first worker. But, the example from Pytorch here showing that saving model at checkpoint using parameter local_rank. Is it okay to do what the Trainer do?