However, I’m interested in achieving the actual performance of wav2vec2 (of 3% WER not 18%). Because this wav2vec2 implementation does not use a language model it suffers at 18%.
But the current fine tuning notebook is using a dataset with text.
Nevertheless, lets say it does need text. If it is fine tuned will it achieve the same performance or similar in the paper above of around 3% or will it also need its own language model like wav2vec2, and remain at around 18%?
Hey @patrickvonplaten, I’m trying to Fine-tune the HuBERT pretrained model on a custom dataset (multilingual). I’m using a Tokenizer that has the required tokens (Already tested for Wav2Vec2). Do i need to change the feature extractor or the Wav2Vec2FeaureExtractor is the one to use even with HuBERT?