Baseline model vs finetuned model for multilingual speech recognition

throwara · September 19, 2022, 5:41pm

Hi, I would like to recognize multilingual speech to translate them to English for further processing and I have a few questions related to it.

What is the difference between using a baseline model vs a fine-tuned model at a specific language?
facebook/wav2vec2-large-xlsr-53 · Hugging Face vs jonatasgrosman/wav2vec2-large-xlsr-53-english · Hugging Face.
Which one do I use out of those and what is the difference between them? Aren’t they the same models?
Can I use XLSR-53 to recognize multilingual speech by itself? Or should I fine-tune the model in each of the 53 languages and store 53 versions of the model (one for each language) and use it like that? Do I have to append language related tokens to the start of the model inputs so that it will recognize speech from that language?

Topic		Replies	Views
Baseline vs language-specific finetuned model for multilingual speech recognition 🤗Transformers	0	313	September 20, 2022
Different versions of 'wav2vec2' model and their differences Beginners	1	1513	August 7, 2021
Wav2vec2-large-xlsr-53 🤗Transformers	4	813	July 26, 2022
Wav2vec2 finetuning and language model Beginners	0	213	October 1, 2023
[Open-to-the-community] XLSR-Wav2Vec2 Fine-Tuning Week for Low-Resource Languages Languages at Hugging Face	411	17416	December 9, 2021