Wav2vec2 not converging when finetuning

meeps · June 13, 2021, 11:54am

Thanks for offering to help me!

For anyone else reading, this is a continuation of the discussion on huggingface/transformers issue 12137. In short, despite following the official wav2vec2 English ASR guide, my model cannot converge on my dataset of single English word, 1 second long audio clips.

When trying out the guide’s Colab Notebook on the TIMIT dataset, however, the model converges just fine which is perplexing to me.

Here is my training notebook.

tadf · June 14, 2021, 6:52am

You probably need to train a bit longer. Did you train the full 30 epochs? Or did you stop at 2?
The wav2vec2 embeddings only learn the representations of speech, it does not know how to output characters. The finetuning stage learns to use the embeddings to output characters.

The usual finetuning training behavior looks sth like:

Beginning: Output random chars
Early: Output nothing - empty strings - looks like you are here?
After a while: Starts to spit out more relevant chars

meeps · June 14, 2021, 9:14am

I stopped at 2 epochs. Ok, I’ll try training for the full 30, will report on how that goes!

meeps · June 14, 2021, 12:02pm

Hi @tadf,

I trained for 50 epochs and a learning rate of 1e-5. The learning rate scheduler was set to default (same as the article). The model does start out by predicting random chars and goes to empty strings, but even after 50 epochs, no relevant chars are created. You can see my output in the file “50epochs_output.pdf” in this folder.

Does this mean even more train time is needed?

Thanks!

tadf · June 14, 2021, 1:44pm

I listened to your audio - the quality is really really bad. It’s no surprise the model isn’t able to predict on it.
Wav2vec2 is trained on 16k audio. What is the sample rate for your data?

meeps · June 14, 2021, 2:02pm

The audio is also sampled at 16kHz. Yeah, the noise and distortion of the audio is intentional - that’s why the dataset is part of a hackathon haha

tadf · June 15, 2021, 12:36am

Oh I see.
I think wav2vec is having a hard time because the audio is so different to the one it was pretrained on.

meeps · June 15, 2021, 10:45am

I guess an ASR model is not suitable in this case then, will have to go back to more audio classification models. Thanks for your help!

Topic		Replies	Views
Wav2Vec2 Fine Tuning Models	0	257	December 21, 2023
Wav2Vec2: loss growing in training and validation after few epochs Models	6	2044	September 25, 2024
Thai ASR: Fine-Tuning Wav2Vec2 Languages at Hugging Face	0	1022	March 18, 2021
Training stops when I try Fine-Tune XLSR-Wav2Vec2 for low-resource ASR Beginners	2	376	August 5, 2021
Wav2Vec2: fix growing training and validation loss after few epochs Models	5	2241	January 27, 2022

Wav2vec2 not converging when finetuning

Related topics