Does HuBERT need text as well as audio for fine-tuning? / How to achieve sub-5% WER?

youssefav · June 16, 2021, 10:37pm

There’s a fine-tuning guide provided here that was for wav2vec2: facebook/hubert-xlarge-ll60k · Hugging Face

However, I’m interested in achieving the actual performance of wav2vec2 (of 3% WER not 18%). Because this wav2vec2 implementation does not use a language model it suffers at 18%.

However, with HuBERT, if I understand correctly, it doesn’t need text? HuBERT: Speech representations for recognition & generation

But the current fine tuning notebook is using a dataset with text.

Nevertheless, lets say it does need text. If it is fine tuned will it achieve the same performance or similar in the paper above of around 3% or will it also need its own language model like wav2vec2, and remain at around 18%?

tadf · October 8, 2021, 8:18am

which parts did you change from the Wav2vec2 example to get hubert to work?

patrickvonplaten · October 11, 2021, 10:50pm

Hey,

Note that we now have an official fine-tuning example that also works for HuBERT:

Also see examples below:

spranjal25 · February 17, 2022, 10:48am

Hey @patrickvonplaten, I’m trying to Fine-tune the HuBERT pretrained model on a custom dataset (multilingual). I’m using a Tokenizer that has the required tokens (Already tested for Wav2Vec2). Do i need to change the feature extractor or the Wav2Vec2FeaureExtractor is the one to use even with HuBERT?

patrickvonplaten · March 18, 2022, 4:15pm

Hey @spranjal25,

For multi-lingual fine-tuning I strongly recommend using the XLS-R models, the should perform much better Think this blog post should help:

Topic		Replies	Views
Hubert ASR Fine Tuning giving weird results Models	1	1332	January 14, 2022
A hypothetical question on multi-headed wav2vec2 / hubert models 🤗Transformers	0	345	December 15, 2021
Cannot train Wav2Vec2 processor with Wav2Vec2 or HuBERT Beginners	3	383	July 17, 2024
Finetuning Wave2Vec vs. Finetuning Distilbert Beginners	1	377	May 31, 2023
Wav2vec2-base task performance Models	4	890	May 8, 2023

Does HuBERT need text as well as audio for fine-tuning? / How to achieve sub-5% WER?

Related topics