How to run an end to end example of distributed data parallel with hugging face's trainer api (ideally on a single node multiple gpus)?

What if I’m using torchrun? Is it still okay? Because if I’m using torchrun, I have to explicitly set condition of local_rank == 0 to finally can save the model.