For anyone else reading, this is a continuation of the discussion on huggingface/transformers issue 12137. In short, despite following the official wav2vec2 English ASR guide, my model cannot converge on my dataset of single English word, 1 second long audio clips.
When trying out the guideās Colab Notebook on the TIMIT dataset, however, the model converges just fine which is perplexing to me.
You probably need to train a bit longer. Did you train the full 30 epochs? Or did you stop at 2?
The wav2vec2 embeddings only learn the representations of speech, it does not know how to output characters. The finetuning stage learns to use the embeddings to output characters.
The usual finetuning training behavior looks sth like:
Beginning: Output random chars
Early: Output nothing - empty strings - looks like you are here?
After a while: Starts to spit out more relevant chars
I trained for 50 epochs and a learning rate of 1e-5. The learning rate scheduler was set to default (same as the article). The model does start out by predicting random chars and goes to empty strings, but even after 50 epochs, no relevant chars are created. You can see my output in the file ā50epochs_output.pdfā in this folder.
I listened to your audio - the quality is really really bad. Itās no surprise the model isnāt able to predict on it.
Wav2vec2 is trained on 16k audio. What is the sample rate for your data?