How to run an end to end example of distributed data parallel with hugging face's trainer api (ideally on a single node multiple gpus)?

Yes, on the backend the trainer does the same thing, (and in accelerate, save_state), only writing/using on the main worker during saving