The pipedream-2bw paper and the Zero-offload paper both show that 1-step delayed asynchronous gradient update doesn’t affect the convergence (and perplexity) while improve the training efficiency (by fully utilize the bubbles in pipeline parallelism) at a large margin.
However, both the Megatron-LM and the DeepSpeed don’t use pipedream-2bw scheduling. Could anyone share me some insights or ideas about why such an efficient scheduling scheme doesn’t get popular in the LLM pretraining community? Does it suffer convergence/accuracy issue in practice? Or are there any other concerns that blocking it become the default / most popular pipeline parallelism scheduling?
(I posted the same question in hacknews as well: Why async gradient update doesn't get popular in LLM community? | Hacker News)
I have tried to implement the pipedream-2bw scheduling scheme on Megatron-LM and do can reproduce the performance gain as well as loss convergence with GPT-2 345M using 8xV100 GPUs.