Wanted to check on this model released by facebook. It a cross-lingual model, does it mean that this model can understand audio that contained mixed languages?
Can I fine-tuned this model with 2 languages dataset?
This is so that I do not need to detect my audio language source before I do the ASR.


I have not personally tried it but finetuning on 2 languages simultaneously would probably work. But as the number of tokens would be higher the accuracy would be lower for the same amount of data.

Another way you can approach it is by running 3 models in parallel. Language detection can be a separate pipeline. I am assuming that you are doing this in a business context. So you can have 3 models running in parallel, one for language detection and one each for the two languages. Based on the output of the language model you can pick the respective ASR output.

Ya i see alot people did fine tune on single language. So was wondering why no one did for 2 languages in one model?

Can this xls-r do language detect? Or is there any language model that does audio language detect?

1 Like

Hi dear @becks
have you done anything to solve your problem?
I have the same problem I want to have a model with both Persian and English characters as you know it’s almost possible in kaldi-based models but with Wav2vec2 I have no idea !
I want to ask @patrickvonplaten is it possible for Wav2vec2 to finetune the model on two different languages simultaneously?


Sorry to reply so late here!

It’s indeed very much possible to fine-tune the model on multiple languages simultaneously. @anton-l has done so for the XLS-R model which can be found in this directory I think: anton-l/xtreme_s_xlsr_300m_mls · Hugging Face

In short you just need to be sure that the character vocab contains all characters of your two languages and there shouldn’t be a problem!

The other possibility would be to use a phoneme-based Wav2Vec2 model instead: Wav2Vec2Phoneme