Hi,
In DeepSpeed, ZeRO 2 was disabled with Pipeline Parallelism due to computational inefficiencies since ZeRO 2 splits gradients and PP accumulates them. I believe ZeRO 3 and Tensor Parallelism are complimentary but I am unsure if the same is for ZeRO 2 as well.
I was wondering if anyone noted whether similar inefficiencies or any issues occur when using ZeRO 2 or 3 with Tensor Parallelism in accelerate?
Before I refactor a model to use tensor parallelism, I wanted to ensure that it would still be completely compatible with ZeRO 2 or 3.
Thank you,
Enrico