Wav2vec fine-tuning with multiGPU

stas · May 10, 2021, 4:40pm

Here is WIP PR that makes it work with deepspeed:

github.com/huggingface/transformers

[Deepspeed Wav2vec2] integration

huggingface:master ← stas00:ds-inputs-dtype

opened 10:29PM - 07 May 21 UTC

stas00

+496 -64

Addressing the need in https://github.com/huggingface/transformers/issues/11446,… this PR is working on making wav2vec2 work under deepspeed. This PR: * changes Trainer to automatically convert inputs to the correct dtype if it's not int64 - we didn't need this for nlp models because embeddings took care of this - this is not the case with wav2vec2 type of models where inputs are float32 by default. (for deepspeed only at the moment - potentially need to do the same for `fp16_full_eval`) * multiple fixes to the `wav2vec2` model, because it does very non-standard things, like model `weight_norm` which is implemented in a very odd way and deepspeed's automatic ease-of-use fails to do that and requires multiple manual adjustments for it to do the right thing. `weight_norm` creates a param, then drops it replacing it with 2 other params and re-creates them on every forward. (in pre-hook). * moves `require_deepspeed` to `testing_utils.py` as we have multiple test files using it * adds `dtype` accessor to DS conf object * adds 8 new tests, checking each setup Testing with `run_asr.py`: ### ZeRO-2 Everything works: * [x] fp16 distributed zero2 * [x] fp16 non distributed zero2 * [x] fp32 distributed zero2 * [x] fp32 non distributed zero2 important - must use for distributed use: ``` "zero_optimization": { "find_unused_parameters": true, ``` So you can use the `--deepspeed examples/research_projects/wav2vec2/ds_config_wav2vec2_zero2.json` which already has the adjustment. ### ZeRO-3 You can use the `--deepspeed examples/research_projects/wav2vec2/ds_config_wav2vec2_zero3.json` This works: * [x] fp32 non distributed zero3 * [x] fp32 distributed zero3 * [x] fp16 non distributed zero3 * [x] fp16 distributed zero3 ### Possible PR spin-offs it looks like plain pytorch dist doesn't work either https://github.com/huggingface/transformers/issues/11452 so this PR can be adapted to detect `dist` and do the same as what deepspeed branch does. probably a separate PR is the best. (LayerSkip that is) --------------------- To run tests: Install this deepspeed master https://github.com/microsoft/DeepSpeed: ``` pip install deepspeed ``` and then: ``` HF_DATASETS_IN_MEMORY_MAX_SIZE=0 RUN_SLOW=1 pyt examples/research_projects/wav2vec2/test_wav2vec2_deepspeed.py ``` --------------------- Example of usage: assuming you in a top dir of the git clone of this branch ### `run_asr.py` and tiny model and tiny dataset This is the foundation for the new tests: ``` HF_DATASETS_IN_MEMORY_MAX_SIZE=0 PYTHONPATH=src deepspeed --num_gpus 2 \ examples/research_projects/wav2vec2/run_asr.py \ --output_dir=output_dir --num_train_epochs=2 --per_device_train_batch_size=2 \ --per_device_eval_batch_size=2 --evaluation_strategy=steps --save_steps=500 --eval_steps=100 \ --logging_steps=5 --learning_rate=5e-4 --warmup_steps=3000 \ --model_name_or_path=patrickvonplaten/wav2vec2_tiny_random_robust \ --dataset_name=patrickvonplaten/librispeech_asr_dummy --dataset_config_name=clean \ --train_split_name=validation --validation_split_name=validation --orthography=timit \ --preprocessing_num_workers=1 --group_by_length --freeze_feature_extractor --verbose_logging \ --deepspeed examples/research_projects/wav2vec2/ds_config_wav2vec2_zero2.json ``` ### run_common_voice.py very hard to test with as it takes some 5-10mins to just get ready to run. **edit**: switch to `datasets` master branch and add `HF_DATASETS_IN_MEMORY_MAX_SIZE=0` to the command line - it will be cached now. `run_common_voice.py` now runs under `--fp16` but gives `loss=nan`, probably the same issue as bf16-pretrained models? I tested - it has the same issue under AMP and no deepspeed. So it's a different problem to solve. fp32 works just fine loss-wise, you can try: ``` HF_DATASETS_IN_MEMORY_MAX_SIZE=0 PYTHONPATH="src" deepspeed --num_gpus=1 \ examples/research_projects/wav2vec2/run_common_voice.py \ --model_name_or_path="facebook/wav2vec2-large-xlsr-53" --dataset_config_name="tr" \ --output_dir=./wav2vec2-large-xlsr-turkish-demo --overwrite_output_dir --num_train_epochs="5" \ --per_device_train_batch_size="16" --learning_rate="3e-4" --warmup_steps="500" \ --evaluation_strategy="steps" --save_steps="5" --eval_steps="5" --logging_steps="5" \ --save_total_limit="3" --freeze_feature_extractor --feat_proj_dropout="0.0" --layerdrop="0.1" \ --gradient_checkpointing --group_by_length --do_train --do_eval --deepspeed \ examples/research_projects/wav2vec2/ds_config_wav2vec2_zero2.json ``` Thanks to @patrickvonplaten for making small wav2vec2 models which helped a ton to debug faster and they were needed for the tests. ## Requirements to merge this PR - [x] https://github.com/microsoft/DeepSpeed/pull/1135 - [x] deepspeed version requirement bumped to 0.4.0 - [x] deepspeed 0.4.0 released Fixes: https://github.com/huggingface/transformers/issues/11446

wav2vec2 has 2 peculiarities:

it randomly skips layers! which I think is what requires find_unused_parameters - in normal dist and also in zero-2. for zero-3 we must run all gpus in sync, so this problem is removed. (see PR)
it uses weight_norm which re-creates 2 params in pre-forward which also has all kinds of potential side-effects. I am attempting to write a fused version of weight_norm + Conv1d which doesn’t use any tricks, but I haven’t fully sorted it out yet.

here is a work in progress:

import torch.nn as nn
from torch.nn.parameter import Parameter
from torch import _weight_norm, norm_except_dim
class Conv1dWithWeightNorm(nn.Conv1d):
    def __init__(self, *args, **kwargs):
        super(Conv1dWithWeightNorm, self).__init__(*args, **kwargs)
        self.dim = 2
        import deepspeed
        with deepspeed.zero.GatheredParameters(self.weight):
            weight = self.weight
        self.weight_g = Parameter(norm_except_dim(weight, 2, self.dim).data)
        self.weight_v = Parameter(weight.data)
        del self._parameters["weight"]
        self.weight = _weight_norm(self.weight_v, self.weight_g, self.dim)
        print(self.weight)

    def compute_weight(self):
        self.weight_g = Parameter(norm_except_dim(self.weight, 2, self.dim).data)
        self.weight_v = Parameter(self.weight.data)
        return _weight_norm(self.weight_v, self.weight_g, self.dim)

    def forward(self, input):
        self.weight = self.compute_weight()
        return self._conv_forward(input, self.weight, self.bias)

class Wav2Vec2PositionalConvEmbedding(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.conv = Conv1dWithWeightNorm(
            in_channels=config.hidden_size,
            out_channels=config.hidden_size,
            kernel_size=config.num_conv_pos_embeddings,
            padding=config.num_conv_pos_embeddings // 2,
            groups=config.num_conv_pos_embedding_groups,
        )
        self.padding = Wav2Vec2SamePadLayer(config.num_conv_pos_embeddings)
        self.activation = ACT2FN[config.feat_extract_activation]

    def forward(self, hidden_states):
        hidden_states = hidden_states.transpose(1, 2)

        hidden_states = self.conv(hidden_states)
        hidden_states = self.padding(hidden_states)
        hidden_states = self.activation(hidden_states)

        hidden_states = hidden_states.transpose(1, 2)
        return hidden_states

Topic		Replies	Views
Multi GPU Audio Finetuning for Wav2vec2 Failing for 4 GPUs but successful for 1 GPU Beginners	0	307	July 9, 2023
How much memory to fine tune wav2vec2? Models	2	1143	March 7, 2022
Wav2vec2.0 memory issue Models	13	11498	December 25, 2024
Wav2Vec2 Fine Tuning Models	0	257	December 21, 2023
How to finetune wav2vec2.0-xlsr model with long audio files Beginners	1	825	September 6, 2022

Wav2vec fine-tuning with multiGPU

Related topics