@stas Excuse me, I’m using deepspeed for multi-node pre-traininig. My training script:
python -m torch.distributed.launch \
--nproc_per_node=8 \
--nnodes=4 \
--node_rank=... \
--master_addr=... \
--master_port=19500 \
--use_env \
transformers/examples/pytorch/language-modeling/run_mlm.py \
--resume_from_checkpoint $CKPT_TO_RESUME \
--config_name $MODEL \
--tokenizer_name $TOKENIZER \
--dataset_name $PT_DATASET \
--max_steps $MAX_STEPS \
...
--deepspeed config.json
After training I get these files in my ckpt dir:
config.json rng_state_10.pth rng_state_17.pth rng_state_23.pth rng_state_2.pth rng_state_7.pth trainer_state.json
global_step1500 rng_state_11.pth rng_state_18.pth rng_state_24.pth rng_state_30.pth rng_state_8.pth training_args.bin
latest rng_state_12.pth rng_state_19.pth rng_state_25.pth rng_state_31.pth rng_state_9.pth vocab.json
merges.txt rng_state_13.pth rng_state_1.pth rng_state_26.pth rng_state_3.pth scheduler.pt zero_to_fp32.py
optimizer.pt rng_state_14.pth rng_state_20.pth rng_state_27.pth rng_state_4.pth special_tokens_map.json
pytorch_model.bin rng_state_15.pth rng_state_21.pth rng_state_28.pth rng_state_5.pth tokenizer_config.json
rng_state_0.pth rng_state_16.pth rng_state_22.pth rng_state_29.pth rng_state_6.pth tokenizer.json
Then I run run_glue.sh
on this ckpt, but the scores are just equal to before pretraining. Is there anything wrong I’m doing? Many thanks for any help!