Thanks for offering to help me!
For anyone else reading, this is a continuation of the discussion on huggingface/transformers issue 12137. In short, despite following the official wav2vec2 English ASR guide, my model cannot converge on my dataset of single English word, 1 second long audio clips.
When trying out the guide’s Colab Notebook on the TIMIT dataset, however, the model converges just fine which is perplexing to me.
Here is my training notebook.