Does HuBERT need text as well as audio for fine-tuning? / How to achieve sub-5% WER?

There’s a fine-tuning guide provided here that was for wav2vec2: facebook/hubert-xlarge-ll60k · Hugging Face

However, I’m interested in achieving the actual performance of wav2vec2 (of 3% WER not 18%). Because this wav2vec2 implementation does not use a language model it suffers at 18%.

However, with HuBERT, if I understand correctly, it doesn’t need text? HuBERT: Speech representations for recognition & generation

But the current fine tuning notebook is using a dataset with text.

Nevertheless, lets say it does need text. If it is fine tuned will it achieve the same performance or similar in the paper above of around 3% or will it also need its own language model like wav2vec2, and remain at around 18%?

which parts did you change from the Wav2vec2 example to get hubert to work?


Note that we now have an official fine-tuning example that also works for HuBERT:

Also see examples below:

Hey @patrickvonplaten, I’m trying to Fine-tune the HuBERT pretrained model on a custom dataset (multilingual). I’m using a Tokenizer that has the required tokens (Already tested for Wav2Vec2). Do i need to change the feature extractor or the Wav2Vec2FeaureExtractor is the one to use even with HuBERT?

Hey @spranjal25,

For multi-lingual fine-tuning I strongly recommend using the XLS-R models, the should perform much better :slight_smile: Think this blog post should help: