Why async gradient update doesn't get popular in LLM community?

The pipedream-2bw paper and the Zero-offload paper both show that 1-step delayed asynchronous gradient update doesn’t affect the convergence (and perplexity) while improve the training efficiency (by fully utilize the bubbles in pipeline parallelism) at a large margin.

However, both the Megatron-LM and the DeepSpeed don’t use pipedream-2bw scheduling. Could anyone share me some insights or ideas about why such an efficient scheduling scheme doesn’t get popular in the LLM pretraining community? Does it suffer convergence/accuracy issue in practice? Or are there any other concerns that blocking it become the default / most popular pipeline parallelism scheduling?

(I posted the same question in hacknews as well: Why async gradient update doesn't get popular in LLM community? | Hacker News)

I have tried to implement the pipedream-2bw scheduling scheme on Megatron-LM and do can reproduce the performance gain as well as loss convergence with GPT-2 345M using 8xV100 GPUs.

1 Like

Could you share your code?

You can find the implementation from: https://github.com/sighingnow/Megatron-LM/blob/ht/dev-pipe/megatron/core/pipeline_parallel/schedules.py#L1447

(I have rebased it against latest Megatron-LM.)

Thank you so much