EDIT: I think the missing piece is the -m torch.distributed.launch flag from the terminal command. I will test this when I get a chance and update the thread if that’s the fix. I am fine-tuning GPT-2 using examples/language-modeling/run_clm.py. It seems like the Trainer class instantiated in it wil…

LM example run_clm.py isn't distributing data across multiple GPUs as expected

brando August 17, 2022, 3:03pm 6

Topic		Replies	Views
Running a Trainer in DistributedDataParallel mode 🤗Transformers	1	1466	October 24, 2020
Multi gpu training 🤗Transformers	3	6052	April 24, 2022
How to run an end to end example of distributed data parallel with hugging face's trainer api (ideally on a single node multiple gpus)? Intermediate	17	18188	September 6, 2023
Distribute training 🤗Transformers	0	317	November 16, 2022
Why is Trainer only using 1 (not 4) GPUs? Beginners	1	1649	June 2, 2022