Eval freezes on local multi GPU Deepspeed run

qqhann · April 26, 2021, 9:05am

Environment info

transformers version: 4.6.0.dev0
Platform: Linux-4.19.112±x86_64-with-Ubuntu-18.04-bionic
Python version: 3.7.10
PyTorch version (GPU?): 1.8.1+cu101 (True)
Tensorflow version (GPU?): 2.4.1 (True)
Using GPU in script?: <2,4>
Using distributed or parallel set-up in script?:

Information

I’m working on wav2vec2.0 using the following official script of huggingface.

huggingface/transformers/blob/main/examples/research_projects/wav2vec2/run_common_voice.py

#!/usr/bin/env python3
import json
import logging
import os
import re
import sys
from dataclasses import dataclass, field
from typing import Any, Dict, List, Optional, Union

import datasets
import numpy as np
import torch
import torchaudio
from packaging import version
from torch import nn

import transformers
from transformers import (
    HfArgumentParser,
    Trainer,

This file has been truncated. show original

I am trying to finetune huggingface model with multiple gpus using deepspeed.

deepspeed --num_gpus=1 run_common_voice.py --deepspeed ds_config.json --do_train --do_eval

works, but

deepspeed --num_gpus=2 run_common_voice.py --deepspeed ds_config.json --do_train --do_eval

stops working and freezes at the end of eval.
The progress bar is 100% done but the eval result is not returned and it freezes.

To reproduce

This is how to reproduce!

Steps to reproduce the behavior:

Install deepspeed
Add with autocast():after line 481 in run_common_voice.py
Set param: --deepspeed ds_config.json --do_train --do_eval
Run run_common_voice.py using deepspeed with 1> gpus

ds_config has the following parameters.

{
  "fp16": {
    "enabled": "true",
    "loss_scale": 0,
    "loss_scale_window": 1000,
    "hysteresis": 2,
    "min_loss_scale": 1,
    "opt_level": "O3"
  },
  "steps_per_print": 100,
  "wall_clock_breakdown": "false"
}

Expected behavior

The finetuning eval should be executed without freezing.

qqhann · April 26, 2021, 9:20am

This is how to reproduce!

sgugger · April 26, 2021, 2:36pm

cc @stas for deepspeed.

stas · April 26, 2021, 4:53pm

deepspeed doesn’t work with autocast, it has its own way of dealing with mixed precision, if you look in the trainer.py it’s carefully bypassed.

If after removing autocast the problem persistes, please let’s use an Issue for debugging problems, so it’s easy to track.

Edit: I see it was already filed: [wav2vec] deepspeed eval bug in the case of >1 gpus · Issue #11446 · huggingface/transformers · GitHub

Thank you.

qqhann · April 28, 2021, 8:53am

Thanks for replying here and on GitHub! I’ll make further reply on the issue.

Topic		Replies	Views
Multi-GPU sharded eval with Trainer and generate method during training DeepSpeed	1	759	May 25, 2023
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! 🤗Accelerate	1	750	May 31, 2024
Trainer freezes after all steps are complete (multi-gpu setting) 🤗Transformers	4	1528	February 14, 2024
Basics for Multi GPU Training with Huggingface Trainer 🤗Transformers	0	2678	June 14, 2023
SFTTrainer Doubling Speed on a Single GPU with DeepSpeed: Proposal for an Update to the Official Documentation and Verification Report DeepSpeed	1	62	March 7, 2025