There’s a fine-tuning guide provided here that was for wav2vec2: facebook/hubert-xlarge-ll60k · Hugging Face
However, I’m interested in achieving the actual performance of wav2vec2 (of 3% WER not 18%). Because this wav2vec2 implementation does not use a language model it suffers at 18%.
However, with HuBERT, if I understand correctly, it doesn’t need text? HuBERT: Speech representations for recognition & generation
But the current fine tuning notebook is using a dataset with text.
Nevertheless, lets say it does need text. If it is fine tuned will it achieve the same performance or similar in the paper above of around 3% or will it also need its own language model like wav2vec2, and remain at around 18%?