Wav2vec fine-tuning with multiGPU

tommy19970714 · April 14, 2021, 1:41pm

@Maimonator Can you tell me how to set the parameter of shared_ddp?

I have tried to use deepspeed. But I occurred error.
That error is the same as the following issue (RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one...).

Can you tell me your json of deepspeed config?

github.com/huggingface/transformers

RuntimeError: while running run_common_voice.py (XLSR wav2vec finetuning week)

opened 04:08AM - 24 Mar 21 UTC

closed 03:02PM - 01 May 21 UTC

raja1196

## Environment info  - `transformers` version: 4.5.0.dev0 (I tried running it on 4.4.0 as well, gave the same error) - Platform: Ubuntu (running on a virtual machine) - Python version: 3.8 - PyTorch version (GPU?): 1.6.0 - Using GPU in script?: yes, running [this script](https://github.com/huggingface/transformers/blob/master/examples/research_projects/wav2vec2/run_common_voice.py) - Using distributed or parallel set-up in script?: Distributed ### Who can help @patrickvonplaten (as per the message on slack group)  ## Information Model I am using (Bert, XLNet ...): The problem arises when using: - [ ] the official example scripts: (give details below) - [ ] my own modified scripts: (give details below) Tried running both official command and modified script (running command changed based on the language) The tasks I am working on is - [ ] common voice dataset (ta) ## To reproduce Steps to reproduce the behavior: 1. run common voice script [from here](https://github.com/huggingface/transformers/blob/master/examples/research_projects/wav2vec2/run_common_voice.py) 2. For multi-gpu setup I used this command `python -m torch.distributed.launch \ --nproc_per_node 4 run_common_voice.py \ --model_name_or_path="facebook/wav2vec2-large-xlsr-53" \ --dataset_config_name="tr" \ # use this argument to specify the language code --output_dir=./wav2vec2-large-xlsr-turkish-demo \ --overwrite_output_dir \ --num_train_epochs="5" \ --per_device_train_batch_size="16" \ --learning_rate="3e-4" \ --warmup_steps="500" \ --evaluation_strategy="steps" \ --save_steps="400" \ --eval_steps="400" \ --logging_steps="400" \ --save_total_limit="3" \ --freeze_feature_extractor \ --feat_proj_dropout="0.0" \ --layerdrop="0.1" \ --gradient_checkpointing \ --fp16 \ --group_by_length \ --do_train --do_eval ` ## Error: `RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by (1) passing the keyword argument 'find_unused_parameters=True' to 'torch.nn.parallel.DistributedDataParallel'; (2) making sure all 'forward' function outputs participate in calculating loss. If you already have done the above two steps, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's 'forward' function. Please include the loss function and the structure of the return value of 'forward' of your module when reporting this issue (e.g. list, dict, iterable).`  ## Expected behavior Model would train without any error

Topic		Replies	Views
Multi GPU Audio Finetuning for Wav2vec2 Failing for 4 GPUs but successful for 1 GPU Beginners	0	307	July 9, 2023
How much memory to fine tune wav2vec2? Models	2	1136	March 7, 2022
Wav2vec2.0 memory issue Models	13	11467	December 25, 2024
Wav2Vec2 Fine Tuning Models	0	253	December 21, 2023
How to finetune wav2vec2.0-xlsr model with long audio files Beginners	1	817	September 6, 2022

Wav2vec fine-tuning with multiGPU

Related topics