Baseline model vs finetuned model for multilingual speech recognition

Hi, I would like to recognize multilingual speech to translate them to English for further processing and I have a few questions related to it.

  1. What is the difference between using a baseline model vs a fine-tuned model at a specific language?
    facebook/wav2vec2-large-xlsr-53 · Hugging Face vs jonatasgrosman/wav2vec2-large-xlsr-53-english · Hugging Face.
    Which one do I use out of those and what is the difference between them? Aren’t they the same models?

  2. Can I use XLSR-53 to recognize multilingual speech by itself? Or should I fine-tune the model in each of the 53 languages and store 53 versions of the model (one for each language) and use it like that? Do I have to append language related tokens to the start of the model inputs so that it will recognize speech from that language?