Training Reformer model from scratch with deepspeed - backprop error

I鈥檓 training a reformer model from scratch on custom data ON Sagemaker. My reformer config is a variant of the one at google/reformer-crime-and-punishment 路 Hugging Face.
I鈥檓 training with Deepspeed Stage 0 on 2 GPU host instances that each have 8 GPUs.
During the first backprop, I keep seeing this error message:

[1,mpirank:0,algo-1]:Parameter at index 71 has been marked as ready twice. This means that multiple autograd engine hooks have fired for this particular parameter during this iteration. You can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print parameter names for further debugging.

There are a few threads that I have seen where this issue is seen with gradient checkpointing. But the Reformer model desn鈥檛 support it. So I think I can exclude that. Any insights on what might be happening here would be welcomed!.