Checkpoint breaks with deepspeed

Hi,
I am trying to continue training from a saved checkpoint when using deepspeed. I am using transformers 4.3.3

Here is how I run the codes. Since T5 pretraining is not added yet to HF repo, I wrote it up myself, and I also modified T5 model only itself by incorporating some adapter layers within the model layers:

USE_TF=0 deepspeed run_mlm.py --model_name_or_path google/mt5-base --dataset_name opus100 --dataset_config_name de-en --do_train --do_eval --output_dir /user/dara/test --max_seq_length 128 --deepspeed ds_config.json --save_steps 10 --fp16

Here is the error I got once trying to continue training from checkpoints. I greatly appreciate your input on this, the key this happens for is ‘exp_avg’. I also add that without deepspeed I do not get this error. Thank you so much. I am really puzzled by this and you are my only hope @stas

[2021-03-16 13:01:52,899] [INFO] [engine.py:1284:_load_checkpoint] rank: 0 loading checkpoint: /users/dara/test/checkpoint-20/global_step20/mp_rank_00_model_states.pt
successfully loaded 1 ZeRO state_dicts for rank 0
p  tensor([ 1.7500, -1.6719,  2.4062,  ..., -0.1953,  0.2002, -0.6484],
       requires_grad=True)  key  exp_avg  saved  torch.Size([15013760])  parameter shape  torch.Size([597396608])
Traceback (most recent call last):
  File "run_mlm.py", line 592, in <module>
    main()
  File "run_mlm.py", line 558, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/user/dara/dev/codes/seq2seq/third_party/trainers/trainer.py", line 780, in train
    self._load_optimizer_and_scheduler(resume_from_checkpoint)
  File "/user/dara/dev/codes/seq2seq/third_party/trainers/trainer.py", line 1169, in _load_optimizer_and_scheduler
    self.deepspeed.load_checkpoint(checkpoint, load_optimizer_states=True, load_lr_scheduler_states=True)
  File "/user/dara/libs/anaconda3/envs/fast/lib/python3.7/site-packages/deepspeed/runtime/engine.py", line 1265, in load_checkpoint
    load_optimizer_states=load_optimizer_states)
  File "/user/dara/libs/anaconda3/envs/fast/lib/python3.7/site-packages/deepspeed/runtime/engine.py", line 1337, in _load_zero_checkpoint
    load_from_fp32_weights=self.zero_load_from_fp32_weights())
  File "/user/dara/libs/anaconda3/envs/fast/lib/python3.7/site-packages/deepspeed/runtime/zero/stage2.py", line 1822, in load_state_dict
    self._restore_base_optimizer_state(state_dict_list)
  File "/user/dara/libs/anaconda3/envs/fast/lib/python3.7/site-packages/deepspeed/runtime/zero/stage2.py", line 1783, in _restore_base_optimizer_state
    self.optimizer.state[p][key].data.copy_(saved.data)
RuntimeError: The size of tensor a (597396608) must match the size of tensor b (15013760) at non-singleton dimension 0

I haven’t seen this type of error yet, but we will sort it out.

Could you please post it as an Issue and meanwhile I will try to reproduce this. Please tag @stas00 in the issue.

Also please make sure you use the deepspeed master as there are a lot of fixes in there.

When posting tracebacks please always use multiline code formatting which preserves new lines so that it’d be possible to decipher it. Your above formatting is very difficult to understand.

I edited your post to fix the formatting.

While you’re filing an Issue, I have cleaned up the code and you can try if it works better for you with this PR [DeepSpeed] improve checkpoint loading code plus tests by stas00 · Pull Request #10760 · huggingface/transformers · GitHub

I haven’t tested it with zero3 yet, so it has been tested with zero2 only.

If it solves your problem then great, there is nothing else that you need to do, if not please do file an Issue as suggested in the comment above.

Dear Stas
Thank you very much, I will file an issue on this as you suggested. Sorry for the delays, I got a bit distracted by another issue as mt5 gets NaN with deepspeed, I will try to open up two separate issues for these if this can be appropriate, following the structure you mentioned.
Thank you very much for all kind help.

The PR I mentioned has already been merged, so please re-test your situation. Hopefully it has already been resolved.

I will create the NaN loss fixing PR shortly. Just need to sort out the tests.

Dear Stas
I opened up this issue, and I truly appreciate your help https://github.com/huggingface/transformers/issues/10821
I had to modify the mt5 model a bit with adding adapter layers which are not yet integrated in huggingface repo, I would be indebted to you if you could have a look and to have your help on this issue.
This is really hard for me to realize the checkpointing issue I encounter with deepspeed and you are my only hope with this issue.
Thank you very much in advance.

I tested your PR with NaN loss, for my case, with adding adapters in mt5 with deepspeed (the same code I shared in the opened issue), I am getting NaN losses, I will be indebted to you to have your advice on things I can try to resolve this issue.
I do not have access to new GPUs and I need to make deepspeed work. Thank you very much for all the incredible jobs you do.