Wav2vec fine-tuning with multiGPU

Hi, @patrickvonplaten @valhalla
I’m fine-tuning wav2vec model with Fine-Tune XLSR-Wav2Vec2 for low-resource ASR with 🤗 Transformers at local machine with 4xT4 GPU (16Gb)
I have some problems with training.

  1. Very slowly process

use_3_gpu

Why has the learning process slowed down so much?

I noticed a strange memory allocation on the GPU

training_args = TrainingArguments(
    output_dir="./rus_model",
    group_by_length=True,
    per_device_train_batch_size=16,
    gradient_accumulation_steps=2,
    evaluation_strategy="steps",
    num_train_epochs=30,
    fp16=True,
    save_steps=400,
    eval_steps=200,
    logging_steps=100,
    learning_rate=3e-4,
    warmup_steps=500,
    save_total_limit=2,
    dataloader_num_workers=16,
    report_to='tensorboard'
)

Related issue: Excessive GPU-GPU communication with GPT2 making multi-GPU training slow? · Issue #9371 · huggingface/transformers · GitHub

The GPU allocation is not unreasonable since one of the 4 GPUs has to store all optimizer state since the gradient updates are always done on only one of the four GPUs.

I have never seen such a heavy slow down when using multiple GPUs. Note also though, that if you leave per_device_train_batch_size=16 when doing multi GPU training you increase your effective batch size by a number of 4 in your setup. But I’m not really sure what’s going on here…what happens if you reduce your batch_size?

Finally, it is recommend by PyTorch to do distributed training instead of “multi-GPU” training. Here is an example on how to do distributed training: transformers/examples at master · huggingface/transformers · GitHub

1 Like

Also gently pinging @sgugger in case my answers are inaccurate or you see a bug, I’m not seeing :slight_smile:

1 Like

I haven’t studied DataParallel enough to know why this is happening or if your explanations are inacurate. I only use the distributed one since it is what PyTorch recommends.

2 Likes

it helped me to choose less GPUs to use via
export CUDA_VISIBLE_DEVICES=1,2,3

Thank you!
I’m start run_common_voice.py with torch.distributed.launch and all 4 GPU the loading normal.
I had to set the size per_device_train_batch_size="3", otherwise the system would give an error CUDA: out_of_memory. :scream: Why? What is the acceleration from this use of multiGPU, in fact, the size of the total batch is 12? :hugs: Where am I going wrong?

I start training my tuned in jupyter model with multiGPU with this script run_common_voice.py i see this error:

RuntimeError: Expected to mark a variable ready only once.
This error is caused by one of the following reasons: 1) Use of a module parameter outside the forward function. Please make sure model parameters are not shared across multiple concurrent forward-backward passes
2) Reused parameters in multiple reentrant backward passes. For example, if you use multiple checkpoint functions to wrap the same part of your model, it would result in the same set of parameters been used by different reentrant backward passes multiple times, and hence marking a variable ready multiple times. DDP does not support such use cases yet.

Traceback (most recent call last):
File “run_common_voice_rus.py”, line 434, in
main()
File “run_common_voice_rus.py”, line 402, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File “/lib/python3.8/site-packages/transformers/trainer.py”, line 1056, in train
tr_loss += self.training_step(model, inputs)
File “run_common_voice_rus.py”, line 247, in training_step
self.scaler.scale(loss).backward()
File “/lib/python3.8/site-packages/torch/tensor.py”, line 245, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File “/lib/python3.8/site-packages/torch/autograd/init.py”, line 145, in backward
Variable._execution_engine.run_backward(
File “/lib/python3.8/site-packages/torch/autograd/function.py”, line 89, in apply
return self._forward_cls.backward(self, *args) # type: ignore
File “lib/python3.8/site-packages/torch/utils/checkpoint.py”, line 112, in backward
torch.autograd.backward(outputs_with_grad, args_with_grad)
File “/lib/python3.8/site-packages/torch/autograd/init.py”, line 145, in backward
Variable._execution_engine.run_backward(
@sgugger
What is the reason for this error? Why does it appear on the pre-train model?

@patrickvonplaten @sgugger Can you help in solving this issue? :pray: :eyes:

Don’t know if this is still relevant but I’ve had a similar issue using Multi-GPU so after a lot of googling I found these:

Using deepspeed helped me accelerate my training procedure, and also make sure you use shared_ddp to make sure memory is evenly distributed among GPUs.

Hopefully that helps

2 Likes

@Maimonator Can you tell me how to set the parameter of shared_ddp?

I have tried to use deepspeed. But I occurred error.
That error is the same as the following issue (RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one...).

Can you tell me your json of deepspeed config?

@tommy19970714
To be honest I’ve found success using the default parameters on my use-case so I never got the chance to dive deeper into the deepspeed configuration, but I found this that might be helpful to you.

@Maimonator Thank you for your reply! The issue you shared was very helpful.

Did you run your experiment without the deepspeed config?

Then I experimented, and the following config worked(In this case, I used huggingface optimizer).

{
  "fp16": {
    "enabled": true,
    "min_loss_scale": 1,
    "opt_level": "O3"
  }
}

However, the next config gave me the following error. (In this case, I used deepspeed optimizer)
Did you use the huggingface optimaizer or the deepspeed optimizer?

{
  "fp16": {
    "enabled": true,
    "min_loss_scale": 1,
    "opt_level": "O3"
  },
  "zero_optimization": {
    "stage": 3,
    "cpu_offload": true,
    "cpu_offload_params": true,
    "contiguous_gradients": true,
    "overlap_comm": true
  },
  "optimizer": {
    "type": "AdamW",
    "params": {
      "lr": 0.001,
      "betas": [
        0.9,
        0.999
      ],
      "eps": 1e-6
    }
  },
  "scheduler": {
    "type": "WarmupLR",
    "params": {
      "warmup_min_lr": 0,
      "warmup_max_lr": 0.001,
      "warmup_num_steps": 50
    }
  }
} 
IndexError: Dimension out of range (expected to be in range of [-1, 0], but got 2)

Also facing issues with multiGPU training.
Trying to finetune with DistributedDataParallel gives me the following error.

RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by (1) passing the keyword argument find_unused_parameters=True to torch.nn.parallel.DistributedDataParallel; (2) making sure all forward function outputs participate in calculating loss. If you already have done the above two steps, then the distributed data parallel module wasn’t able to locate the output tensors in the return value of your module’s forward function. Please include the loss function and the structure of the return value of forward of your module when reporting this issue (e.g. list, dict, iterable).

If I set find_unused_parameters=True then I get the same error as @gorodecki

RuntimeError: Expected to mark a variable ready only once.
This error is caused by one of the following reasons: 1) Use of a module parameter outside the forward function. Please make sure model parameters are not shared across multiple concurrent forward-backward passes
2) Reused parameters in multiple reentrant backward passes. For example, if you use multiple checkpoint functions to wrap the same part of your model, it would result in the same set of parameters been used by different reentrant backward passes multiple times, and hence marking a variable ready multiple times. DDP does not support such use cases yet.

All is fine with one GPU however.

Here is WIP PR that makes it work with deepspeed:

wav2vec2 has 2 peculiarities:

  1. it randomly skips layers! which I think is what requires find_unused_parameters - in normal dist and also in zero-2. for zero-3 we must run all gpus in sync, so this problem is removed. (see PR)
  2. it uses weight_norm which re-creates 2 params in pre-forward which also has all kinds of potential side-effects. I am attempting to write a fused version of weight_norm + Conv1d which doesn’t use any tricks, but I haven’t fully sorted it out yet.

here is a work in progress:

import torch.nn as nn
from torch.nn.parameter import Parameter
from torch import _weight_norm, norm_except_dim
class Conv1dWithWeightNorm(nn.Conv1d):
    def __init__(self, *args, **kwargs):
        super(Conv1dWithWeightNorm, self).__init__(*args, **kwargs)
        self.dim = 2
        import deepspeed
        with deepspeed.zero.GatheredParameters(self.weight):
            weight = self.weight
        self.weight_g = Parameter(norm_except_dim(weight, 2, self.dim).data)
        self.weight_v = Parameter(weight.data)
        del self._parameters["weight"]
        self.weight = _weight_norm(self.weight_v, self.weight_g, self.dim)
        print(self.weight)

    def compute_weight(self):
        self.weight_g = Parameter(norm_except_dim(self.weight, 2, self.dim).data)
        self.weight_v = Parameter(self.weight.data)
        return _weight_norm(self.weight_v, self.weight_g, self.dim)

    def forward(self, input):
        self.weight = self.compute_weight()
        return self._conv_forward(input, self.weight, self.bias)

class Wav2Vec2PositionalConvEmbedding(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.conv = Conv1dWithWeightNorm(
            in_channels=config.hidden_size,
            out_channels=config.hidden_size,
            kernel_size=config.num_conv_pos_embeddings,
            padding=config.num_conv_pos_embeddings // 2,
            groups=config.num_conv_pos_embedding_groups,
        )
        self.padding = Wav2Vec2SamePadLayer(config.num_conv_pos_embeddings)
        self.activation = ACT2FN[config.feat_extract_activation]

    def forward(self, hidden_states):
        hidden_states = hidden_states.transpose(1, 2)

        hidden_states = self.conv(hidden_states)
        hidden_states = self.padding(hidden_states)
        hidden_states = self.activation(hidden_states)

        hidden_states = hidden_states.transpose(1, 2)
        return hidden_states
2 Likes

While it is a late reply, I also encountered the same situation as @liaspas. The culprit is actually the gradient checkpointing in the encoder. To be specific, it is an operation that reruns some forward pass instead of caching them during backward pass to save GPU memory usage. As the error message suggested, this will cause the torch ddp algorithm to mark them as ready twice. To resolve this, just disable it manually by setting gradient_checkpointing to false for the encoder.config. Since gradient checkpointing exchange computation for memory, this will actually speed up the training given you have large enough VRAM. Also you need the find_unused_parameters=True as not all parameters are used in the forward pass.