When I use transformers to train model, and enable FSDP and grad accumulation, however, I found that there is no reduce-scatter in backward util the last grad accumulation step. Does anybody know why?
When I use transformers to train model, and enable FSDP and grad accumulation, however, I found that there is no reduce-scatter in backward util the last grad accumulation step. Does anybody know why?
Pinging @muellerzr here