How to run single-node, multi-GPU training with HF Trainer?

Hi,

I want to train Trainer scripts on single-node, multi-GPU setting.
Do I need to launch HF with a torch launcher (torch.distributed, torchX, torchrun, Ray Train, PTL etc) or can the HF Trainer alone use multiple GPUs without being launched by a third-party distributed launcher?

1 Like

See the documentation on running scripts. :slight_smile:

I think the docs are insufficient. See my questions here: Using Transformers with DistributedDataParallel — any examples?

4 Likes

My impression with HF Trainer is HF has lots of video tutorials and none talks about multi GPU training using Trainer (assuming it is so simple) but the key element is lost in the docs, which is the command to run the trainer script which is really hard to find. So the easiest API is made hard by missing to mention this script, which I finally found in one of the forums

3 Likes