Wav2vec2 not converging when finetuning

Hi @patrickvonplaten,

Thanks for offering to help me! :smiley:

For anyone else reading, this is a continuation of the discussion on huggingface/transformers issue 12137. In short, despite following the official wav2vec2 English ASR guide, my model cannot converge on my dataset of single English word, 1 second long audio clips.

When trying out the guideā€™s Colab Notebook on the TIMIT dataset, however, the model converges just fine which is perplexing to me.

Here is my training notebook.

You probably need to train a bit longer. Did you train the full 30 epochs? Or did you stop at 2?
The wav2vec2 embeddings only learn the representations of speech, it does not know how to output characters. The finetuning stage learns to use the embeddings to output characters.

The usual finetuning training behavior looks sth like:

  • Beginning: Output random chars
  • Early: Output nothing - empty strings - looks like you are here?
  • After a while: Starts to spit out more relevant chars

I stopped at 2 epochs. Ok, Iā€™ll try training for the full 30, will report on how that goes!

Hi @tadf,

I trained for 50 epochs and a learning rate of 1e-5. The learning rate scheduler was set to default (same as the article). The model does start out by predicting random chars and goes to empty strings, but even after 50 epochs, no relevant chars are created. You can see my output in the file ā€œ50epochs_output.pdfā€ in this folder.

Does this mean even more train time is needed?

Thanks!

I listened to your audio - the quality is really really bad. Itā€™s no surprise the model isnā€™t able to predict on it.
Wav2vec2 is trained on 16k audio. What is the sample rate for your data?

The audio is also sampled at 16kHz. Yeah, the noise and distortion of the audio is intentional - thatā€™s why the dataset is part of a hackathon haha :smiley:

Oh I see.
I think wav2vec is having a hard time because the audio is so different to the one it was pretrained on.

I guess an ASR model is not suitable in this case then, will have to go back to more audio classification models. Thanks for your help!