Baseline vs language-specific finetuned model for multilingual speech recognition

throwara · September 20, 2022, 7:59am

Hi, I would like to recognize multilingual speech to translate them to English for further processing and I have a few questions related to it.

What is the difference between using a baseline model vs a fine-tuned model at a specific language?
facebook/wav2vec2-large-xlsr-53 · Hugging Face vs jonatasgrosman/wav2vec2-large-xlsr-53-english · Hugging Face.
Which one do I use out of those and what is the difference between them? Aren’t they the same models?
Can I use XLSR-53 to recognize multilingual speech by itself in an end-to-end manner? Or should I fine-tune the model in each of the 53 languages and store 53 versions of the model (one for each language) and use it like that? Do I have to append language related tokens to the start of the model inputs so that it will recognize speech from that language?

Topic		Replies	Views
Baseline model vs finetuned model for multilingual speech recognition Beginners	0	221	September 19, 2022
Different versions of 'wav2vec2' model and their differences Beginners	1	1490	August 7, 2021
Wav2vec2-large-xlsr-53 🤗Transformers	4	811	July 26, 2022
Wav2vec2-large-xlsr-53 for non-listed low resource language 🤗Transformers	1	484	May 11, 2021
[Open-to-the-community] XLSR-Wav2Vec2 Fine-Tuning Week for Low-Resource Languages Languages at Hugging Face	411	17407	December 9, 2021