ZeRO 2 and 3 with Tensor Parallelism

Hi,

In DeepSpeed, ZeRO 2 was disabled with Pipeline Parallelism due to computational inefficiencies since ZeRO 2 splits gradients and PP accumulates them. I believe ZeRO 3 and Tensor Parallelism are complimentary but I am unsure if the same is for ZeRO 2 as well.

I was wondering if anyone noted whether similar inefficiencies or any issues occur when using ZeRO 2 or 3 with Tensor Parallelism in accelerate?

Before I refactor a model to use tensor parallelism, I wanted to ensure that it would still be completely compatible with ZeRO 2 or 3.

Thank you,

Enrico

1 Like