DeepSpeed Further Training Issue

@stas Excuse me, I’m using deepspeed for multi-node pre-traininig. My training script:

python -m torch.distributed.launch \
    --nproc_per_node=8 \
    --nnodes=4 \
    --node_rank=... \
    --master_addr=... \
    --master_port=19500 \
    --use_env \
    transformers/examples/pytorch/language-modeling/ \
    --resume_from_checkpoint $CKPT_TO_RESUME \
    --config_name $MODEL \
    --tokenizer_name $TOKENIZER \
    --dataset_name $PT_DATASET \
    --max_steps $MAX_STEPS \
    --deepspeed config.json

After training I get these files in my ckpt dir:

config.json rng_state_10.pth rng_state_17.pth rng_state_23.pth rng_state_2.pth rng_state_7.pth trainer_state.json
global_step1500 rng_state_11.pth rng_state_18.pth rng_state_24.pth rng_state_30.pth rng_state_8.pth training_args.bin
latest rng_state_12.pth rng_state_19.pth rng_state_25.pth rng_state_31.pth rng_state_9.pth vocab.json
merges.txt rng_state_13.pth rng_state_1.pth rng_state_26.pth rng_state_3.pth rng_state_14.pth rng_state_20.pth rng_state_27.pth rng_state_4.pth special_tokens_map.json
pytorch_model.bin rng_state_15.pth rng_state_21.pth rng_state_28.pth rng_state_5.pth tokenizer_config.json
rng_state_0.pth rng_state_16.pth rng_state_22.pth rng_state_29.pth rng_state_6.pth tokenizer.json

Then I run on this ckpt, but the scores are just equal to before pretraining. Is there anything wrong I’m doing? Many thanks for any help!

Apologies for not replying sooner, since I left HF my account is now stuck with a disabled emall at that I can’t change so I get no notifications about someone tagging me.

It looks that perhaps it’s not loading the custom checkpoint and loads the default model instead?

Enabling a higher logging level like INFO is likely to log which model path is getting loaded.

But it’s also possible that your training, I assume finetuning didn’t improve the original baseline which is also a possibility.

If you’re still stuck probably the best to ask at the transformers Issues and provide more information like I suggested above - as what you shared here is insufficient to come to any conclusions?

1 Like

I tested it out and it seems like a problem with deepspeed config. After adjusting the config and solve the conflicts with the launching script config, it works. Thanks so much for your reply!