The pipedream-2bw paper and the Zero-offload paper both show that 1-step delayed asynchronous gradient update doesn’t affect the convergence (and perplexity) while improve the training efficiency (by fully utilize the bubbles in pipeline parallelism) at a large margin. However, both the Megatron-LM …

Why async gradient update doesn't get popular in LLM community?

TurboPascal October 13, 2023, 8:50am 4

Thank you so much

Topic		Replies	Views
How to make huge LM fit to multi GPU? Beginners	0	1266	July 20, 2022
Could I use the device map for pipelines parallel training? 🤗Transformers	0	245	April 3, 2023
Fused Kernel Operations Intermediate	0	629	July 26, 2022
Model Parallism DeepSpeed	0	186	April 21, 2024
Am I doing multiple GPU right? Intermediate	8	497	November 29, 2024