Hi,
I am trying to continue training from a saved checkpoint when using deepspeed. I am using transformers 4.3.3
Here is how I run the codes. Since T5 pretraining is not added yet to HF repo, I wrote it up myself, and I also modified T5 model only itself by incorporating some adapter layers within the model layers:
USE_TF=0 deepspeed run_mlm.py --model_name_or_path google/mt5-base --dataset_name opus100 --dataset_config_name de-en --do_train --do_eval --output_dir /user/dara/test --max_seq_length 128 --deepspeed ds_config.json --save_steps 10 --fp16
Here is the error I got once trying to continue training from checkpoints. I greatly appreciate your input on this, the key this happens for is ‘exp_avg’. I also add that without deepspeed I do not get this error. Thank you so much. I am really puzzled by this and you are my only hope @stas
[2021-03-16 13:01:52,899] [INFO] [engine.py:1284:_load_checkpoint] rank: 0 loading checkpoint: /users/dara/test/checkpoint-20/global_step20/mp_rank_00_model_states.pt
successfully loaded 1 ZeRO state_dicts for rank 0
p tensor([ 1.7500, -1.6719, 2.4062, ..., -0.1953, 0.2002, -0.6484],
requires_grad=True) key exp_avg saved torch.Size([15013760]) parameter shape torch.Size([597396608])
Traceback (most recent call last):
File "run_mlm.py", line 592, in <module>
main()
File "run_mlm.py", line 558, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/user/dara/dev/codes/seq2seq/third_party/trainers/trainer.py", line 780, in train
self._load_optimizer_and_scheduler(resume_from_checkpoint)
File "/user/dara/dev/codes/seq2seq/third_party/trainers/trainer.py", line 1169, in _load_optimizer_and_scheduler
self.deepspeed.load_checkpoint(checkpoint, load_optimizer_states=True, load_lr_scheduler_states=True)
File "/user/dara/libs/anaconda3/envs/fast/lib/python3.7/site-packages/deepspeed/runtime/engine.py", line 1265, in load_checkpoint
load_optimizer_states=load_optimizer_states)
File "/user/dara/libs/anaconda3/envs/fast/lib/python3.7/site-packages/deepspeed/runtime/engine.py", line 1337, in _load_zero_checkpoint
load_from_fp32_weights=self.zero_load_from_fp32_weights())
File "/user/dara/libs/anaconda3/envs/fast/lib/python3.7/site-packages/deepspeed/runtime/zero/stage2.py", line 1822, in load_state_dict
self._restore_base_optimizer_state(state_dict_list)
File "/user/dara/libs/anaconda3/envs/fast/lib/python3.7/site-packages/deepspeed/runtime/zero/stage2.py", line 1783, in _restore_base_optimizer_state
self.optimizer.state[p][key].data.copy_(saved.data)
RuntimeError: The size of tensor a (597396608) must match the size of tensor b (15013760) at non-singleton dimension 0