Hey everyone. I want to use wav2vec2 to perform ASR using data in my language (Greek). As such, I took a look at the various wav2vec2 pretrained models that exist in the model hub, and there are two things I don’t understand:
Some versions, like this facebook/wav2vec2-large-lv60 · Hugging Face, say in the description that the model ‘should be fine-tuned on a downstream task, like Automatic Speech Recognition’. On the other hand, other versions like ‘facebook/wav2vec2-large-960h-lv60’ (sorry, can’t post more than 2 links), impose no such requirement and also provide code snippets as an example of how to use the particular model.
Furthermore, the group of models mentioned first, do not have code examples, but contain links to this amazing blog post Fine-Tune XLSR-Wav2Vec2 for low-resource ASR with 🤗 Transformers which I studied.
Forgive me if my question is getting too big, but I’d like to ask something more related to this blog post. I noticed 2 things: (1) the author does not load the tokenizer and feature extractor using the ‘from_pretrained()’ method, but instantiates them by themselves and (2) the second group of models mentioned earlier (those who include code examples in their page), do not use a feature extractor at all. What are the reasons behind these distinctions?
Sorry again for the lengthy question. I’d really appreciate any help. Thanks in advance!