@patrickvonplaten Thank you for the great work on releasing many variants of wav2vec2 and tutorials. They are super helpful. I am new to the ASR domain and able to reproduce some of the results with the released models. I was trying to compare the WER to the paper. I noticed there is a gap between HF’s models and the paper scores. I couldn’t figure out where the gap was coming from. Putting them in the table below, could you shed some light? Thank you!
Model |
Pretraining Dataset |
Fine Tuning Dataset |
Eval. Dataset |
WER (%) |
Relative (%) |
Wav2vec 2.0 - Table 1, 3rd row from bottom |
LS-960 |
Labelled LS-100 |
clean/test |
2.60 |
baseline |
facebook/wav2vec2-base-100h |
LS-960 |
Labelled LS-100 |
clean/test |
6.10 |
235% |
Wav2vec 2.0 - Table 1, 3rd row from bottom |
LS-960 |
Labelled LS-100 |
other/test |
6.3 |
baseline |
facebook/wav2vec2-base-100h |
LS-960 |
Labelled LS-100 |
other/test |
13.5 |
214% |
|
|
|
|
|
|
Wav2vec 2.0 - Table 2, 3rd row from bottom |
LS-960 |
Labelled LS-960 |
clean/test |
2.1 |
baseline |
facebook/wav2vec2-base-960h |
LS-960 |
Labelled LS-960 |
clean/test |
3.4 |
162% |
Wav2vec 2.0 - Table 2, 3rd row from bottom |
LS-960 |
Labelled LS-960 |
other/test |
4.8 |
baseline |
facebook/wav2vec2-base-960h |
LS-960 |
Labelled LS-960 |
other/test |
8.6 |
179% |
I think maybe that’s because in the paper they add a transformer LM to improve the performance. By the way, I am wondering if there is any discussion about how to combine a transformer LM with huggingface’s wav2vev2 model as well. I’ve found one blog article written by @patrickvonplaten showing how to boost wav2vec2 with n-gram LM, but currently I don’t know how to combine the model with a transformer LM.
1 Like
It should actually be very easy to add a LM to Wav2Vec2, I have it done here: patrickvonplaten/wav2vec2-base-100h-with-lm · Hugging Face
All you need to do is to take an official ngram, e.g. this one: openslr.org
and then just follow the blog post here: Boosting Wav2Vec2 with n-grams in 🤗 Transformers
The results without LM should match more or less - I’ve tested this for a couple of checkpoints
Thanks @patrickvonplaten , @Kuray107 for your comments. It appeared that I misunderstood and make incorrect correspondence. Updated table as follow. HF’s models are close to paper
Model |
Pretraining Dataset |
Fine Tuning Dataset |
Eval. Dataset |
WER (%) |
Relative (%) |
Wav2vec 2.0 - Table 9, 9th row from bottom |
LS-960 |
Labelled LS-100 |
test/clean |
6.1 |
baseline |
facebook/wav2vec2-base-100h |
LS-960 |
Labelled LS-100 |
test/clean |
6.1 |
0% |
Wav2vec 2.0 - Table 9, 9rd row from bottom |
LS-960 |
Labelled LS-100 |
test/other |
13.3 |
baseline |
facebook/wav2vec2-base-100h |
LS-960 |
Labelled LS-100 |
test/other |
13.5 |
2% |
|
|
|
|
|
|
Wav2vec 2.0 - Table 10, 9th row from bottom |
LS-960 |
Labelled LS-960 |
test/clean |
3.4 |
baseline |
facebook/wav2vec2-base-960h |
LS-960 |
Labelled LS-960 |
test/clean |
3.4 |
0% |
Wav2vec 2.0 - Table 10, 9th row from bottom |
LS-960 |
Labelled LS-960 |
test/other |
8.5 |
baseline |
facebook/wav2vec2-base-960h |
LS-960 |
Labelled LS-960 |
test/other |
8.6 |
1% |