Environment info
-
transformers
version: 4.6.0.dev0 - Platform: Linux-4.19.112±x86_64-with-Ubuntu-18.04-bionic
- Python version: 3.7.10
- PyTorch version (GPU?): 1.8.1+cu101 (True)
- Tensorflow version (GPU?): 2.4.1 (True)
- Using GPU in script?: <2,4>
- Using distributed or parallel set-up in script?:
Information
I’m working on wav2vec2.0 using the following official script of huggingface.
I am trying to finetune huggingface model with multiple gpus using deepspeed.
deepspeed --num_gpus=1 run_common_voice.py --deepspeed ds_config.json --do_train --do_eval
works, but
deepspeed --num_gpus=2 run_common_voice.py --deepspeed ds_config.json --do_train --do_eval
stops working and freezes at the end of eval.
The progress bar is 100% done but the eval result is not returned and it freezes.
To reproduce
This is how to reproduce!
Steps to reproduce the behavior:
- Install deepspeed
- Add
with autocast():
after line 481 in run_common_voice.py - Set param:
--deepspeed ds_config.json --do_train --do_eval
- Run run_common_voice.py using deepspeed with 1> gpus
ds_config has the following parameters.
{
"fp16": {
"enabled": "true",
"loss_scale": 0,
"loss_scale_window": 1000,
"hysteresis": 2,
"min_loss_scale": 1,
"opt_level": "O3"
},
"steps_per_print": 100,
"wall_clock_breakdown": "false"
}
Expected behavior
The finetuning eval should be executed without freezing.