Eval freezes on local multi GPU Deepspeed run

Environment info

  • transformers version: 4.6.0.dev0
  • Platform: Linux-4.19.112±x86_64-with-Ubuntu-18.04-bionic
  • Python version: 3.7.10
  • PyTorch version (GPU?): 1.8.1+cu101 (True)
  • Tensorflow version (GPU?): 2.4.1 (True)
  • Using GPU in script?: <2,4>
  • Using distributed or parallel set-up in script?:

Information

I’m working on wav2vec2.0 using the following official script of huggingface.

I am trying to finetune huggingface model with multiple gpus using deepspeed.

deepspeed --num_gpus=1 run_common_voice.py --deepspeed ds_config.json --do_train --do_eval

works, but

deepspeed --num_gpus=2 run_common_voice.py --deepspeed ds_config.json --do_train --do_eval

stops working and freezes at the end of eval.
The progress bar is 100% done but the eval result is not returned and it freezes.

To reproduce

This is how to reproduce!

Steps to reproduce the behavior:

  1. Install deepspeed
  2. Add with autocast():after line 481 in run_common_voice.py
  3. Set param: --deepspeed ds_config.json --do_train --do_eval
  4. Run run_common_voice.py using deepspeed with 1> gpus

ds_config has the following parameters.

{
  "fp16": {
    "enabled": "true",
    "loss_scale": 0,
    "loss_scale_window": 1000,
    "hysteresis": 2,
    "min_loss_scale": 1,
    "opt_level": "O3"
  },
  "steps_per_print": 100,
  "wall_clock_breakdown": "false"
}

Expected behavior

The finetuning eval should be executed without freezing.

This is how to reproduce!

cc @stas for deepspeed.

1 Like

deepspeed doesn’t work with autocast, it has its own way of dealing with mixed precision, if you look in the trainer.py it’s carefully bypassed.

If after removing autocast the problem persistes, please let’s use an Issue for debugging problems, so it’s easy to track.

Edit: I see it was already filed: [wav2vec] deepspeed eval bug in the case of >1 gpus · Issue #11446 · huggingface/transformers · GitHub

Thank you.

1 Like

Thanks for replying here and on GitHub! I’ll make further reply on the issue.