Transformer model parallel does not work with Pytorch DDP for multi-node training

The naive model parallelism in Transformer models such as T5 does not work with pytorch Distributed Data Parallel for multi-node training. It will freeze at the step of torch.nn.parallel.DistributedDataParallel(Model). I follow this instruction to write the code. Anyone met similar situations?