Why async gradient update doesn't get popular in LLM community?

sighingnow · October 10, 2023, 12:51pm

The pipedream-2bw paper and the Zero-offload paper both show that 1-step delayed asynchronous gradient update doesn’t affect the convergence (and perplexity) while improve the training efficiency (by fully utilize the bubbles in pipeline parallelism) at a large margin.

However, both the Megatron-LM and the DeepSpeed don’t use pipedream-2bw scheduling. Could anyone share me some insights or ideas about why such an efficient scheduling scheme doesn’t get popular in the LLM pretraining community? Does it suffer convergence/accuracy issue in practice? Or are there any other concerns that blocking it become the default / most popular pipeline parallelism scheduling?

(I posted the same question in hacknews as well: Why async gradient update doesn't get popular in LLM community? | Hacker News)

I have tried to implement the pipedream-2bw scheduling scheme on Megatron-LM and do can reproduce the performance gain as well as loss convergence with GPT-2 345M using 8xV100 GPUs.

TurboPascal · October 13, 2023, 6:49am

Could you share your code?

sighingnow · October 13, 2023, 7:38am

You can find the implementation from: https://github.com/sighingnow/Megatron-LM/blob/ht/dev-pipe/megatron/core/pipeline_parallel/schedules.py#L1447

(I have rebased it against latest Megatron-LM.)

TurboPascal · October 13, 2023, 8:50am

Thank you so much

Topic		Replies	Views
ZeRO 2 and 3 with Tensor Parallelism Intermediate	0	1143	July 3, 2022
Can the Scheduler and Model be paired without any constraints? Beginners	0	8	February 21, 2025
Manual pipeline parallelization with DeepSpeed DeepSpeed	0	762	January 7, 2023
How to make huge LM fit to multi GPU? Beginners	0	1260	July 20, 2022
[Nov 16th Event] Sylvain Gugger: Supercharge your PyTorch training loop with 🤗 Accelerate Course	5	396	November 16, 2021

Why async gradient update doesn't get popular in LLM community?

Related topics