Training Reformer model from scratch with deepspeed - backprop error

abilek · April 26, 2024, 8:34pm

I’m training a reformer model from scratch on custom data ON Sagemaker. My reformer config is a variant of the one at google/reformer-crime-and-punishment · Hugging Face.
I’m training with Deepspeed Stage 0 on 2 GPU host instances that each have 8 GPUs.
During the first backprop, I keep seeing this error message:

[1,mpirank:0,algo-1]:Parameter at index 71 has been marked as ready twice. This means that multiple autograd engine hooks have fired for this particular parameter during this iteration. You can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print parameter names for further debugging.

There are a few threads that I have seen where this issue is seen with gradient checkpointing. But the Reformer model desn’t support it. So I think I can exclude that. Any insights on what might be happening here would be welcomed!.

Topic		Replies	Views
DeepSpeed error: a leaf Variable that requires grad is being used in an in-place operation DeepSpeed	1	78	July 26, 2024
XLNet pre-training fails with multiple GPUs on Sagemaker 🤗Transformers	0	248	July 9, 2023
[Solved] Cannot restart training from deepspeed checkpoint Intermediate	3	2680	December 28, 2023
Corrupted deepspeed checkpoint DeepSpeed	1	153	March 13, 2025
Multi-node training 🤗Accelerate	2	2972	January 16, 2023

Training Reformer model from scratch with deepspeed - backprop error

Related topics