Why is Wav2Vec pretraining loss not decreasing?

Hi there everyone :hugs:

I’m currently trying to train a Wav2Vec base model. During the pre-training phase, The loss starts off around 4, decreases and then shoots up to 6.658 and stays there. The accuracy is also low and does not increase. My learning rate is set at 0.005. I started off with a learning rate of 0.0001 and started increasing it gradually when I saw these results. I use the english Wav2Vec model for weight initialisation. I thought it would improve if I waited longer but it stays the same even after 20 epochs. Can anyone please share some advice on what I could do to avoid this and improve the training?

Your assistance will be much appreciated! :hugs:


Hi ZaNi,
Have you found the reason for such behaviour? I’m facing the same problem pretraining my model from English base model.

@ZaNi @KhusainovAidar Hi, I have the same problem. Is there any solution?

Hi @patrickvonplaten

Thank you for your time!
I followed this pretrain script, but the loss is not decreasing. Like this

Could you give us some advice? Thank you!


@AndySun I haven’t found it. The strange thing that after restarting training from checkpoint (for which loss score was already near zero) it shows more realistic loss scores for first several steps and after that goes to 0.0003 again. So for now I’ve just trained wav2vec2 with fairseq and converted the resulting model into TorchScript.

@KhusainovAidar Okay. Thanks for your information!

@KhusainovAidar Hi Bro, could you tell us how to convert the fairseq pre-trained model to Transformers wav2vec formant?

Thank you!

There is no such option, at least I haven’t found it. I’m using this repo to convert fairseq model to TorchScript and use it in ‘server’-mode audio/examples/libtorchaudio/speech_recognition at master · pytorch/audio · GitHub

Okay! The TorchScript can be used for ASR product. Thanks a lot!

Hi, @KhusainovAidar

I found one solution to convert fairseq pre-trained model to Transformers format. For your reference.

python -m transformers.models.wav2vec2.convert_wav2vec2_original_pytorch_checkpoint_to_pytorch --pytorch_dump_folder_path ./converted_model/ --checkpoint_path /path/to/**.pt --not_finetuned


I’m also having this issue. Has anybody figured out what the issue is? I’m using the following hyperparameters:

python ./code/pretrain.py \
	--dataset_name="./code/VoicesOfColor" \
	--dataset_config_names="train" \
	--dataset_split_names="TEST" \
	--output_dir="./code/wav2vec2-pretrained-VOC/artefacts" \
	--model_name_or_path="patrickvonplaten/wav2vec2-base-v2" \
	--max_train_steps=600000 \
	--num_warmup_steps=32000 \
	--gradient_accumulation_steps=4 \
	--learning_rate=0.001 \
	--weight_decay=0.01 \
	--max_duration_in_seconds=30.0 \
	--min_duration_in_seconds=2.0 \
	--logging_steps=1 \
	--saving_steps=10000 \
	--per_device_train_batch_size=8 \
	--per_device_eval_batch_size=8 \
	--adam_beta1=0.9 \
	--adam_beta2=0.98 \