Deepspeed and Trainer does not exit after training is completed

Hi there,

I am finetuning a roberta model on question-answering task using squad. I used the scripts given here.

I am using deepspeed zero stage 1 and I have 4 GPUs.

The command I executed was

MODEL_PATH="/grand/SolarWindowsADSP/dingyun/optoelectronics/training/development/dapt_roberta/model.checkpoints"
CHECKPOINT_NAME="modelPBS.$RUN_NAME.checkpoint"

deepspeed \
    --hostfile=${HOSTFILE} --no_local_rank run_qa.py \
    --model_name_or_path "$MODEL_PATH/$CHECKPOINT_NAME" \
    --config_name "FacebookAI/roberta-base"\
    --tokenizer_name "FacebookAI/roberta-base" \
    --token "$READ_TOKEN" \
    --dataset_name squad \
    --report_to "wandb" \
    --do_train \
    --do_eval \
    --learning_rate 4e-5 \
    --warmup_ratio 0.2 \
    --num_train_epochs 4 \
    --max_seq_length 384 \
    --per_device_train_batch_size 32 \
    --per_device_eval_batch_size 32 \
    --doc_stride 128 \
    --save_strategy "steps" \
    --save_steps 300 \
    --save_total_limit 4 \
    --evaluation_strategy "steps" \
    --eval_steps 300 \
    --eval_accumulation_steps 10 \
    --logging_strategy steps \
    --logging_steps 50 \
    --load_best_model_at_end \
    --metric_for_best_model "exact_match" \
    --output_dir "$OUTDIR" \
    --overwrite_output_dir \
    --run_name $RUN_NAME \
    --deepspeed ds_config_zero1.json \
    --seed 50 \

The training, evaluation, and checkpoint saving are normal, but the code hangs after the training finishes.
The final lines of the log output are

251 Training completed. Do not forget to share your model on huggingface.co/models =)
252 [INFO|trainer.py:2521] 2024-04-30 22:44:33,057 >> Loading best model from /grand/SolarWindowsADSP/dingyun/optoelectronics/training/qa-from-checkpoints/roberta-base/epoch5/checkpoint-2700 (score: 85.17502365184484).
253 [INFO|deepspeed.py:430] 2024-04-30 22:44:33,059 >> Attempting to resume from /grand/SolarWindowsADSP/dingyun/optoelectronics/training/qa-from-checkpoints/roberta-base/epoch5/checkpoint-2700
254 [2024-04-30 22:44:33,061] [INFO] [torch_checkpoint_engine.py:27:load] [Torch] Loading checkpoint from /grand/SolarWindowsADSP/dingyun/optoelectronics/training/qa-from-checkpoints/roberta-base/epoch5/checkpoint-2700/global_step2700/mp_rank_00_model_states.pt...
255 [2024-04-30 22:44:33,129] [INFO] [torch_checkpoint_engine.py:29:load] [Torch] Loaded checkpoint from /grand/SolarWindowsADSP/dingyun/optoelectronics/training/qa-from-checkpoints/roberta-base/epoch5/checkpoint-2700/global_step2700/mp_rank_00_model_states.pt.
256 [2024-04-30 22:44:33,130] [INFO] [torch_checkpoint_engine.py:27:load] [Torch] Loading checkpoint from /grand/SolarWindowsADSP/dingyun/optoelectronics/training/qa-from-checkpoints/roberta-base/epoch5/checkpoint-2700/global_step2700/mp_rank_00_model_states.pt...
257 [2024-04-30 22:44:33,212] [INFO] [torch_checkpoint_engine.py:29:load] [Torch] Loaded checkpoint from /grand/SolarWindowsADSP/dingyun/optoelectronics/training/qa-from-checkpoints/roberta-base/epoch5/checkpoint-2700/global_step2700/mp_rank_00_model_states.pt.
258 [2024-04-30 22:44:33,248] [INFO] [torch_checkpoint_engine.py:27:load] [Torch] Loading checkpoint from /grand/SolarWindowsADSP/dingyun/optoelectronics/training/qa-from-checkpoints/roberta-base/epoch5/checkpoint-2700/global_step2700/zero_pp_rank_0_mp_rank_00_optim_states.pt...
259 [2024-04-30 22:44:33,290] [INFO] [torch_checkpoint_engine.py:29:load] [Torch] Loaded checkpoint from /grand/SolarWindowsADSP/dingyun/optoelectronics/training/qa-from-checkpoints/roberta-base/epoch5/checkpoint-2700/global_step2700/zero_pp_rank_0_mp_rank_00_optim_states.pt.
260 [2024-04-30 22:44:33,290] [INFO] [engine.py:3023:_get_all_zero_checkpoint_state_dicts] successfully read 4 ZeRO state_dicts for rank 0
261 [2024-04-30 22:44:33,309] [INFO] [engine.py:2973:_load_zero_checkpoint] loading 4 zero partition checkpoints for rank 0

All helps are appreciated!

1 Like

This worked for me:

https://github.com/microsoft/DeepSpeed/issues/5205#issuecomment-2023038565

rm -rf ~/.triton