Hi, I would like to recognize multilingual speech to translate them to English for further processing and I have a few questions related to it.
-
What is the difference between using a baseline model vs a fine-tuned model at a specific language?
facebook/wav2vec2-large-xlsr-53 路 Hugging Face vs jonatasgrosman/wav2vec2-large-xlsr-53-english 路 Hugging Face.
Which one do I use out of those and what is the difference between them? Aren鈥檛 they the same models? -
Can I use XLSR-53 to recognize multilingual speech by itself in an end-to-end manner? Or should I fine-tune the model in each of the 53 languages and store 53 versions of the model (one for each language) and use it like that? Do I have to append language related tokens to the start of the model inputs so that it will recognize speech from that language?