The naive model parallelism in Transformer models such as T5 does not work with pytorch Distributed Data Parallel for multi-node training. It will freeze at the step of torch.nn.parallel.DistributedDataParallel(Model). I follow this instruction to write the code. Anyone met similar situations?