Wav2vec fine-tuning with multiGPU

gorodecki · March 22, 2021, 1:50pm

Hi, @patrickvonplaten @valhalla
I’m fine-tuning wav2vec model with Fine-Tune XLSR-Wav2Vec2 for low-resource ASR with 🤗 Transformers at local machine with 4xT4 GPU (16Gb)
I have some problems with training.

Very slowly process
use_one_gpu1057×316 42.3 KB

use_3_gpu

Why has the learning process slowed down so much?

gorodecki · March 22, 2021, 2:26pm

I noticed a strange memory allocation on the GPU

training_args = TrainingArguments(
    output_dir="./rus_model",
    group_by_length=True,
    per_device_train_batch_size=16,
    gradient_accumulation_steps=2,
    evaluation_strategy="steps",
    num_train_epochs=30,
    fp16=True,
    save_steps=400,
    eval_steps=200,
    logging_steps=100,
    learning_rate=3e-4,
    warmup_steps=500,
    save_total_limit=2,
    dataloader_num_workers=16,
    report_to='tensorboard'
)

patrickvonplaten · March 23, 2021, 3:58pm

Related issue: Excessive GPU-GPU communication with GPT2 making multi-GPU training slow? · Issue #9371 · huggingface/transformers · GitHub

patrickvonplaten · March 23, 2021, 4:03pm

The GPU allocation is not unreasonable since one of the 4 GPUs has to store all optimizer state since the gradient updates are always done on only one of the four GPUs.

I have never seen such a heavy slow down when using multiple GPUs. Note also though, that if you leave per_device_train_batch_size=16 when doing multi GPU training you increase your effective batch size by a number of 4 in your setup. But I’m not really sure what’s going on here…what happens if you reduce your batch_size?

Finally, it is recommend by PyTorch to do distributed training instead of “multi-GPU” training. Here is an example on how to do distributed training: transformers/examples at master · huggingface/transformers · GitHub

patrickvonplaten · March 23, 2021, 4:05pm

Also gently pinging @sgugger in case my answers are inaccurate or you see a bug, I’m not seeing

sgugger · March 23, 2021, 4:58pm

I haven’t studied DataParallel enough to know why this is happening or if your explanations are inacurate. I only use the distributed one since it is what PyTorch recommends.

Srulikbdd · March 24, 2021, 3:24pm

it helped me to choose less GPUs to use via
export CUDA_VISIBLE_DEVICES=1,2,3

gorodecki · March 24, 2021, 9:21pm

Thank you!
I’m start run_common_voice.py with torch.distributed.launch and all 4 GPU the loading normal.
I had to set the size per_device_train_batch_size="3", otherwise the system would give an error CUDA: out_of_memory. Why? What is the acceleration from this use of multiGPU, in fact, the size of the total batch is 12? Where am I going wrong?

gorodecki · March 26, 2021, 11:37am

I start training my tuned in jupyter model with multiGPU with this script run_common_voice.py i see this error:

RuntimeError: Expected to mark a variable ready only once.
This error is caused by one of the following reasons: 1) Use of a module parameter outside the forward function. Please make sure model parameters are not shared across multiple concurrent forward-backward passes
2) Reused parameters in multiple reentrant backward passes. For example, if you use multiple checkpoint functions to wrap the same part of your model, it would result in the same set of parameters been used by different reentrant backward passes multiple times, and hence marking a variable ready multiple times. DDP does not support such use cases yet.

Traceback (most recent call last):
File “run_common_voice_rus.py”, line 434, in
main()
File “run_common_voice_rus.py”, line 402, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File “/lib/python3.8/site-packages/transformers/trainer.py”, line 1056, in train
tr_loss += self.training_step(model, inputs)
File “run_common_voice_rus.py”, line 247, in training_step
self.scaler.scale(loss).backward()
File “/lib/python3.8/site-packages/torch/tensor.py”, line 245, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File “/lib/python3.8/site-packages/torch/autograd/init.py”, line 145, in backward
Variable._execution_engine.run_backward(
File “/lib/python3.8/site-packages/torch/autograd/function.py”, line 89, in apply
return self._forward_cls.backward(self, *args) # type: ignore
File “lib/python3.8/site-packages/torch/utils/checkpoint.py”, line 112, in backward
torch.autograd.backward(outputs_with_grad, args_with_grad)
File “/lib/python3.8/site-packages/torch/autograd/init.py”, line 145, in backward
Variable._execution_engine.run_backward(
@sgugger
What is the reason for this error? Why does it appear on the pre-train model?

gorodecki · March 31, 2021, 4:33pm

@patrickvonplaten @sgugger Can you help in solving this issue?

Maimonator · April 11, 2021, 10:15am

Don’t know if this is still relevant but I’ve had a similar issue using Multi-GPU so after a lot of googling I found these:

Using deepspeed helped me accelerate my training procedure, and also make sure you use shared_ddp to make sure memory is evenly distributed among GPUs.

Hopefully that helps

tommy19970714 · April 14, 2021, 1:41pm

@Maimonator Can you tell me how to set the parameter of shared_ddp?

I have tried to use deepspeed. But I occurred error.
That error is the same as the following issue (RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one...).

Can you tell me your json of deepspeed config?

github.com/huggingface/transformers

RuntimeError: while running run_common_voice.py (XLSR wav2vec finetuning week)

opened 04:08AM - 24 Mar 21 UTC

closed 03:02PM - 01 May 21 UTC

raja1196

## Environment info  - `transformers` version: 4.5.0.dev0 (I tried running it on 4.4.0 as well, gave the same error) - Platform: Ubuntu (running on a virtual machine) - Python version: 3.8 - PyTorch version (GPU?): 1.6.0 - Using GPU in script?: yes, running [this script](https://github.com/huggingface/transformers/blob/master/examples/research_projects/wav2vec2/run_common_voice.py) - Using distributed or parallel set-up in script?: Distributed ### Who can help @patrickvonplaten (as per the message on slack group)  ## Information Model I am using (Bert, XLNet ...): The problem arises when using: - [ ] the official example scripts: (give details below) - [ ] my own modified scripts: (give details below) Tried running both official command and modified script (running command changed based on the language) The tasks I am working on is - [ ] common voice dataset (ta) ## To reproduce Steps to reproduce the behavior: 1. run common voice script [from here](https://github.com/huggingface/transformers/blob/master/examples/research_projects/wav2vec2/run_common_voice.py) 2. For multi-gpu setup I used this command `python -m torch.distributed.launch \ --nproc_per_node 4 run_common_voice.py \ --model_name_or_path="facebook/wav2vec2-large-xlsr-53" \ --dataset_config_name="tr" \ # use this argument to specify the language code --output_dir=./wav2vec2-large-xlsr-turkish-demo \ --overwrite_output_dir \ --num_train_epochs="5" \ --per_device_train_batch_size="16" \ --learning_rate="3e-4" \ --warmup_steps="500" \ --evaluation_strategy="steps" \ --save_steps="400" \ --eval_steps="400" \ --logging_steps="400" \ --save_total_limit="3" \ --freeze_feature_extractor \ --feat_proj_dropout="0.0" \ --layerdrop="0.1" \ --gradient_checkpointing \ --fp16 \ --group_by_length \ --do_train --do_eval ` ## Error: `RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by (1) passing the keyword argument 'find_unused_parameters=True' to 'torch.nn.parallel.DistributedDataParallel'; (2) making sure all 'forward' function outputs participate in calculating loss. If you already have done the above two steps, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's 'forward' function. Please include the loss function and the structure of the return value of 'forward' of your module when reporting this issue (e.g. list, dict, iterable).`  ## Expected behavior Model would train without any error

Maimonator · April 14, 2021, 4:03pm

@tommy19970714
To be honest I’ve found success using the default parameters on my use-case so I never got the chance to dive deeper into the deepspeed configuration, but I found this that might be helpful to you.

github.com/huggingface/transformers

OOM when trying to fine tune patrickvonplaten/led-large-16384-pubmed

opened 06:28PM - 04 Feb 21 UTC

closed 03:03PM - 23 Apr 21 UTC

mmoya01

DeepSpeed

I'm currently following this [notebook](https://colab.research.google.com/drive/…12LjJazBl7Gam0XBPy_y0CTOJZeZ34c2v?usp=sharing#scrollTo=tLM3niQqhEzP) but instead I'm using `patrickvonplaten/led-large-16384-pubmed` ```python tokenizer = AutoTokenizer.from_pretrained("patrickvonplaten/led-large-16384-pubmed",) led = AutoModelForSeq2SeqLM.from_pretrained( "patrickvonplaten/led-large-16384-pubmed", gradient_checkpointing=True, use_cache=False, ) ``` instead of `allenai/led-large-16384` as the base model and tokenizer. I'm also using my own train/test data. With the exception of that, I kept everything else the same/consistent to that notebook as far as fine tuning. However, I'm running into OOM errors ``` RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 15.78 GiB total capacity; 13.96 GiB already allocated; 20.00 MiB free; 14.56 GiB reserved in total by PyTorch) 0%| | 0/3 [00:10<?, ?it/s] ``` on a couple of`Tesla V100-SXM2-16GB` and I'm not sure why that might be. The `batch_size=2` seems pretty small and I also set `gradient_checkpoint=True`. @patrickvonplaten and/or the surrounding community, I'd greatly appreciate any help with this

tommy19970714 · April 14, 2021, 4:21pm

@Maimonator Thank you for your reply! The issue you shared was very helpful.

Did you run your experiment without the deepspeed config?

Then I experimented, and the following config worked(In this case, I used huggingface optimizer).

{
  "fp16": {
    "enabled": true,
    "min_loss_scale": 1,
    "opt_level": "O3"
  }
}

However, the next config gave me the following error. (In this case, I used deepspeed optimizer)
Did you use the huggingface optimaizer or the deepspeed optimizer?

{
  "fp16": {
    "enabled": true,
    "min_loss_scale": 1,
    "opt_level": "O3"
  },
  "zero_optimization": {
    "stage": 3,
    "cpu_offload": true,
    "cpu_offload_params": true,
    "contiguous_gradients": true,
    "overlap_comm": true
  },
  "optimizer": {
    "type": "AdamW",
    "params": {
      "lr": 0.001,
      "betas": [
        0.9,
        0.999
      ],
      "eps": 1e-6
    }
  },
  "scheduler": {
    "type": "WarmupLR",
    "params": {
      "warmup_min_lr": 0,
      "warmup_max_lr": 0.001,
      "warmup_num_steps": 50
    }
  }
}

IndexError: Dimension out of range (expected to be in range of [-1, 0], but got 2)

liaspas · April 17, 2021, 1:19pm

Also facing issues with multiGPU training.
Trying to finetune with DistributedDataParallel gives me the following error.

RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by (1) passing the keyword argument find_unused_parameters=True to torch.nn.parallel.DistributedDataParallel; (2) making sure all forward function outputs participate in calculating loss. If you already have done the above two steps, then the distributed data parallel module wasn’t able to locate the output tensors in the return value of your module’s forward function. Please include the loss function and the structure of the return value of forward of your module when reporting this issue (e.g. list, dict, iterable).

If I set find_unused_parameters=True then I get the same error as @gorodecki

RuntimeError: Expected to mark a variable ready only once.
This error is caused by one of the following reasons: 1) Use of a module parameter outside the forward function. Please make sure model parameters are not shared across multiple concurrent forward-backward passes
2) Reused parameters in multiple reentrant backward passes. For example, if you use multiple checkpoint functions to wrap the same part of your model, it would result in the same set of parameters been used by different reentrant backward passes multiple times, and hence marking a variable ready multiple times. DDP does not support such use cases yet.

All is fine with one GPU however.

stas · May 10, 2021, 4:40pm

Here is WIP PR that makes it work with deepspeed:

github.com/huggingface/transformers

[Deepspeed Wav2vec2] integration

huggingface:master ← stas00:ds-inputs-dtype

opened 10:29PM - 07 May 21 UTC

stas00

+496 -64

Addressing the need in https://github.com/huggingface/transformers/issues/11446,… this PR is working on making wav2vec2 work under deepspeed. This PR: * changes Trainer to automatically convert inputs to the correct dtype if it's not int64 - we didn't need this for nlp models because embeddings took care of this - this is not the case with wav2vec2 type of models where inputs are float32 by default. (for deepspeed only at the moment - potentially need to do the same for `fp16_full_eval`) * multiple fixes to the `wav2vec2` model, because it does very non-standard things, like model `weight_norm` which is implemented in a very odd way and deepspeed's automatic ease-of-use fails to do that and requires multiple manual adjustments for it to do the right thing. `weight_norm` creates a param, then drops it replacing it with 2 other params and re-creates them on every forward. (in pre-hook). * moves `require_deepspeed` to `testing_utils.py` as we have multiple test files using it * adds `dtype` accessor to DS conf object * adds 8 new tests, checking each setup Testing with `run_asr.py`: ### ZeRO-2 Everything works: * [x] fp16 distributed zero2 * [x] fp16 non distributed zero2 * [x] fp32 distributed zero2 * [x] fp32 non distributed zero2 important - must use for distributed use: ``` "zero_optimization": { "find_unused_parameters": true, ``` So you can use the `--deepspeed examples/research_projects/wav2vec2/ds_config_wav2vec2_zero2.json` which already has the adjustment. ### ZeRO-3 You can use the `--deepspeed examples/research_projects/wav2vec2/ds_config_wav2vec2_zero3.json` This works: * [x] fp32 non distributed zero3 * [x] fp32 distributed zero3 * [x] fp16 non distributed zero3 * [x] fp16 distributed zero3 ### Possible PR spin-offs it looks like plain pytorch dist doesn't work either https://github.com/huggingface/transformers/issues/11452 so this PR can be adapted to detect `dist` and do the same as what deepspeed branch does. probably a separate PR is the best. (LayerSkip that is) --------------------- To run tests: Install this deepspeed master https://github.com/microsoft/DeepSpeed: ``` pip install deepspeed ``` and then: ``` HF_DATASETS_IN_MEMORY_MAX_SIZE=0 RUN_SLOW=1 pyt examples/research_projects/wav2vec2/test_wav2vec2_deepspeed.py ``` --------------------- Example of usage: assuming you in a top dir of the git clone of this branch ### `run_asr.py` and tiny model and tiny dataset This is the foundation for the new tests: ``` HF_DATASETS_IN_MEMORY_MAX_SIZE=0 PYTHONPATH=src deepspeed --num_gpus 2 \ examples/research_projects/wav2vec2/run_asr.py \ --output_dir=output_dir --num_train_epochs=2 --per_device_train_batch_size=2 \ --per_device_eval_batch_size=2 --evaluation_strategy=steps --save_steps=500 --eval_steps=100 \ --logging_steps=5 --learning_rate=5e-4 --warmup_steps=3000 \ --model_name_or_path=patrickvonplaten/wav2vec2_tiny_random_robust \ --dataset_name=patrickvonplaten/librispeech_asr_dummy --dataset_config_name=clean \ --train_split_name=validation --validation_split_name=validation --orthography=timit \ --preprocessing_num_workers=1 --group_by_length --freeze_feature_extractor --verbose_logging \ --deepspeed examples/research_projects/wav2vec2/ds_config_wav2vec2_zero2.json ``` ### run_common_voice.py very hard to test with as it takes some 5-10mins to just get ready to run. **edit**: switch to `datasets` master branch and add `HF_DATASETS_IN_MEMORY_MAX_SIZE=0` to the command line - it will be cached now. `run_common_voice.py` now runs under `--fp16` but gives `loss=nan`, probably the same issue as bf16-pretrained models? I tested - it has the same issue under AMP and no deepspeed. So it's a different problem to solve. fp32 works just fine loss-wise, you can try: ``` HF_DATASETS_IN_MEMORY_MAX_SIZE=0 PYTHONPATH="src" deepspeed --num_gpus=1 \ examples/research_projects/wav2vec2/run_common_voice.py \ --model_name_or_path="facebook/wav2vec2-large-xlsr-53" --dataset_config_name="tr" \ --output_dir=./wav2vec2-large-xlsr-turkish-demo --overwrite_output_dir --num_train_epochs="5" \ --per_device_train_batch_size="16" --learning_rate="3e-4" --warmup_steps="500" \ --evaluation_strategy="steps" --save_steps="5" --eval_steps="5" --logging_steps="5" \ --save_total_limit="3" --freeze_feature_extractor --feat_proj_dropout="0.0" --layerdrop="0.1" \ --gradient_checkpointing --group_by_length --do_train --do_eval --deepspeed \ examples/research_projects/wav2vec2/ds_config_wav2vec2_zero2.json ``` Thanks to @patrickvonplaten for making small wav2vec2 models which helped a ton to debug faster and they were needed for the tests. ## Requirements to merge this PR - [x] https://github.com/microsoft/DeepSpeed/pull/1135 - [x] deepspeed version requirement bumped to 0.4.0 - [x] deepspeed 0.4.0 released Fixes: https://github.com/huggingface/transformers/issues/11446

wav2vec2 has 2 peculiarities:

it randomly skips layers! which I think is what requires find_unused_parameters - in normal dist and also in zero-2. for zero-3 we must run all gpus in sync, so this problem is removed. (see PR)
it uses weight_norm which re-creates 2 params in pre-forward which also has all kinds of potential side-effects. I am attempting to write a fused version of weight_norm + Conv1d which doesn’t use any tricks, but I haven’t fully sorted it out yet.

here is a work in progress:

import torch.nn as nn
from torch.nn.parameter import Parameter
from torch import _weight_norm, norm_except_dim
class Conv1dWithWeightNorm(nn.Conv1d):
    def __init__(self, *args, **kwargs):
        super(Conv1dWithWeightNorm, self).__init__(*args, **kwargs)
        self.dim = 2
        import deepspeed
        with deepspeed.zero.GatheredParameters(self.weight):
            weight = self.weight
        self.weight_g = Parameter(norm_except_dim(weight, 2, self.dim).data)
        self.weight_v = Parameter(weight.data)
        del self._parameters["weight"]
        self.weight = _weight_norm(self.weight_v, self.weight_g, self.dim)
        print(self.weight)

    def compute_weight(self):
        self.weight_g = Parameter(norm_except_dim(self.weight, 2, self.dim).data)
        self.weight_v = Parameter(self.weight.data)
        return _weight_norm(self.weight_v, self.weight_g, self.dim)

    def forward(self, input):
        self.weight = self.compute_weight()
        return self._conv_forward(input, self.weight, self.bias)

class Wav2Vec2PositionalConvEmbedding(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.conv = Conv1dWithWeightNorm(
            in_channels=config.hidden_size,
            out_channels=config.hidden_size,
            kernel_size=config.num_conv_pos_embeddings,
            padding=config.num_conv_pos_embeddings // 2,
            groups=config.num_conv_pos_embedding_groups,
        )
        self.padding = Wav2Vec2SamePadLayer(config.num_conv_pos_embeddings)
        self.activation = ACT2FN[config.feat_extract_activation]

    def forward(self, hidden_states):
        hidden_states = hidden_states.transpose(1, 2)

        hidden_states = self.conv(hidden_states)
        hidden_states = self.padding(hidden_states)
        hidden_states = self.activation(hidden_states)

        hidden_states = hidden_states.transpose(1, 2)
        return hidden_states

DAlorlicorn · May 22, 2021, 11:47pm

While it is a late reply, I also encountered the same situation as @liaspas. The culprit is actually the gradient checkpointing in the encoder. To be specific, it is an operation that reruns some forward pass instead of caching them during backward pass to save GPU memory usage. As the error message suggested, this will cause the torch ddp algorithm to mark them as ready twice. To resolve this, just disable it manually by setting gradient_checkpointing to false for the encoder.config. Since gradient checkpointing exchange computation for memory, this will actually speed up the training given you have large enough VRAM. Also you need the find_unused_parameters=True as not all parameters are used in the forward pass.

Topic		Replies	Views
[Deepspeed] ZeRO-Infinity integration released and config changes DeepSpeed	2	2295	April 28, 2021
Constantly running out of memory fine-tuning Wav2Vec2 DeepSpeed	1	976	April 28, 2022
Issues saving and loading wav2vec2 models fine tuned using Deepspeed DeepSpeed	1	1642	March 3, 2023
Eval freezes on local multi GPU Deepspeed run DeepSpeed	4	2903	April 28, 2021
Wav2vec2.0 memory issue Models	13	11523	December 25, 2024

Wav2vec fine-tuning with multiGPU

Related topics