Grad Accumulation in FSDP

When I use transformers to train model, and enable FSDP and grad accumulation, however, I found that there is no reduce-scatter in backward util the last grad accumulation step. Does anybody know why?

2 Likes

Pinging @muellerzr here

2 Likes