Does HuBERT need text as well as audio for fine-tuning? / How to achieve sub-5% WER?

There’s a fine-tuning guide provided here that was for wav2vec2: facebook/hubert-xlarge-ll60k · Hugging Face

However, I’m interested in achieving the actual performance of wav2vec2 (of 3% WER not 18%). Because this wav2vec2 implementation does not use a language model it suffers at 18%.

However, with HuBERT, if I understand correctly, it doesn’t need text? HuBERT: Speech representations for recognition & generation

But the current fine tuning notebook is using a dataset with text.

Nevertheless, lets say it does need text. If it is fine tuned will it achieve the same performance or similar in the paper above of around 3% or will it also need its own language model like wav2vec2, and remain at around 18%?