Wav2vec2-base task performance

@patrickvonplaten Thank you for the great work on releasing many variants of wav2vec2 and tutorials. They are super helpful. I am new to the ASR domain and able to reproduce some of the results with the released models. I was trying to compare the WER to the paper. I noticed there is a gap between HF’s models and the paper scores. I couldn’t figure out where the gap was coming from. Putting them in the table below, could you shed some light? Thank you!

Model Pretraining Dataset Fine Tuning Dataset Eval. Dataset WER (%) Relative (%)
Wav2vec 2.0 - Table 1, 3rd row from bottom LS-960 Labelled LS-100 clean/test 2.60 baseline
facebook/wav2vec2-base-100h LS-960 Labelled LS-100 clean/test 6.10 235%
Wav2vec 2.0 - Table 1, 3rd row from bottom LS-960 Labelled LS-100 other/test 6.3 baseline
facebook/wav2vec2-base-100h LS-960 Labelled LS-100 other/test 13.5 214%
Wav2vec 2.0 - Table 2, 3rd row from bottom LS-960 Labelled LS-960 clean/test 2.1 baseline
facebook/wav2vec2-base-960h LS-960 Labelled LS-960 clean/test 3.4 162%
Wav2vec 2.0 - Table 2, 3rd row from bottom LS-960 Labelled LS-960 other/test 4.8 baseline
facebook/wav2vec2-base-960h LS-960 Labelled LS-960 other/test 8.6 179%

I think maybe that’s because in the paper they add a transformer LM to improve the performance. By the way, I am wondering if there is any discussion about how to combine a transformer LM with huggingface’s wav2vev2 model as well. I’ve found one blog article written by @patrickvonplaten showing how to boost wav2vec2 with n-gram LM, but currently I don’t know how to combine the model with a transformer LM.

1 Like

It should actually be very easy to add a LM to Wav2Vec2, I have it done here: patrickvonplaten/wav2vec2-base-100h-with-lm · Hugging Face

All you need to do is to take an official ngram, e.g. this one: openslr.org

and then just follow the blog post here: Boosting Wav2Vec2 with n-grams in 🤗 Transformers

The results without LM should match more or less - I’ve tested this for a couple of checkpoints :slight_smile:

Thanks @patrickvonplaten , @Kuray107 for your comments. It appeared that I misunderstood and make incorrect correspondence. Updated table as follow. HF’s models are close to paper :slight_smile:

Model Pretraining Dataset Fine Tuning Dataset Eval. Dataset WER (%) Relative (%)
Wav2vec 2.0 - Table 9, 9th row from bottom LS-960 Labelled LS-100 test/clean 6.1 baseline
facebook/wav2vec2-base-100h LS-960 Labelled LS-100 test/clean 6.1 0%
Wav2vec 2.0 - Table 9, 9rd row from bottom LS-960 Labelled LS-100 test/other 13.3 baseline
facebook/wav2vec2-base-100h LS-960 Labelled LS-100 test/other 13.5 2%
Wav2vec 2.0 - Table 10, 9th row from bottom LS-960 Labelled LS-960 test/clean 3.4 baseline
facebook/wav2vec2-base-960h LS-960 Labelled LS-960 test/clean 3.4 0%
Wav2vec 2.0 - Table 10, 9th row from bottom LS-960 Labelled LS-960 test/other 8.5 baseline
facebook/wav2vec2-base-960h LS-960 Labelled LS-960 test/other 8.6 1%