Transformer model parallel does not work with Pytorch DDP for multi-node training

dyu2 · September 1, 2022, 1:16am

The naive model parallelism in Transformer models such as T5 does not work with pytorch Distributed Data Parallel for multi-node training. It will freeze at the step of torch.nn.parallel.DistributedDataParallel(Model). I follow this instruction to write the code. Anyone met similar situations?

Topic		Replies	Views
Multi gpu training 🤗Transformers	3	6022	April 24, 2022
Model Parallelism, how to parallelize transformer? Beginners	3	12753	June 18, 2021
Using Transformers with DistributedDataParallel — any examples? Intermediate	11	23406	May 8, 2023
Model parallel with deepspeed integration Beginners	0	643	September 14, 2021
Single Node Multi GPU FlanT5 fine-tuning using HF Dataset and HF Trainer 🤗Transformers	4	2059	July 5, 2023

Transformer model parallel does not work with Pytorch DDP for multi-node training

Related topics