How to run an end to end example of distributed data parallel with hugging face's trainer api (ideally on a single node multiple gpus)?

Do you mean raw PyTorch? As torchrun is just the PyTorch version of calling accelerate launch. it just handles spinning up the multi-gpu session, nothing about the code