DeepSpeed Further Training Issue

JackBAI · October 13, 2023, 7:50pm

@stas Excuse me, I’m using deepspeed for multi-node pre-traininig. My training script:

python -m torch.distributed.launch \
    --nproc_per_node=8 \
    --nnodes=4 \
    --node_rank=... \
    --master_addr=... \
    --master_port=19500 \
    --use_env \
    transformers/examples/pytorch/language-modeling/run_mlm.py \
    --resume_from_checkpoint $CKPT_TO_RESUME \
    --config_name $MODEL \
    --tokenizer_name $TOKENIZER \
    --dataset_name $PT_DATASET \
    --max_steps $MAX_STEPS \
    ...
    --deepspeed config.json

After training I get these files in my ckpt dir:

config.json rng_state_10.pth rng_state_17.pth rng_state_23.pth rng_state_2.pth rng_state_7.pth trainer_state.json
global_step1500 rng_state_11.pth rng_state_18.pth rng_state_24.pth rng_state_30.pth rng_state_8.pth training_args.bin
latest rng_state_12.pth rng_state_19.pth rng_state_25.pth rng_state_31.pth rng_state_9.pth vocab.json
merges.txt rng_state_13.pth rng_state_1.pth rng_state_26.pth rng_state_3.pth scheduler.pt zero_to_fp32.py
optimizer.pt rng_state_14.pth rng_state_20.pth rng_state_27.pth rng_state_4.pth special_tokens_map.json
pytorch_model.bin rng_state_15.pth rng_state_21.pth rng_state_28.pth rng_state_5.pth tokenizer_config.json
rng_state_0.pth rng_state_16.pth rng_state_22.pth rng_state_29.pth rng_state_6.pth tokenizer.json

Then I run run_glue.sh on this ckpt, but the scores are just equal to before pretraining. Is there anything wrong I’m doing? Many thanks for any help!

stas · November 11, 2023, 2:56am

Apologies for not replying sooner, since I left HF my account is now stuck with a disabled emall at hf.co that I can’t change so I get no notifications about someone tagging me.

It looks that perhaps it’s not loading the custom checkpoint and loads the default model instead?

Enabling a higher logging level like INFO is likely to log which model path is getting loaded.

But it’s also possible that your training, I assume finetuning didn’t improve the original baseline which is also a possibility.

If you’re still stuck probably the best to ask at the transformers Issues and provide more information like I suggested above - as what you shared here is insufficient to come to any conclusions?

JackBAI · November 25, 2023, 2:38pm

I tested it out and it seems like a problem with deepspeed config. After adjusting the config and solve the conflicts with the launching script config, it works. Thanks so much for your reply!

Topic		Replies	Views
Question about using trainer with DeepSpeed 🤗Transformers	0	454	April 25, 2023
[Solved] Cannot restart training from deepspeed checkpoint Intermediate	3	2687	December 28, 2023
Checkpoint breaks with deepspeed 🤗Transformers	6	3438	March 20, 2021
Error using deepspeed for sftconfig DeepSpeed	1	32	April 21, 2025
Deepspeed and Trainer does not exit after training is completed Beginners	1	204	July 30, 2024

DeepSpeed Further Training Issue

Related topics