Wav2vec2-large-xlsr-53

becks · April 12, 2021, 2:44am

Hi,
Wanted to check on this model released by facebook. It a cross-lingual model, does it mean that this model can understand audio that contained mixed languages?
Can I fine-tuned this model with 2 languages dataset?
This is so that I do not need to detect my audio language source before I do the ASR.

Thanks
Becks

infinitejoy · April 12, 2021, 5:14am

I have not personally tried it but finetuning on 2 languages simultaneously would probably work. But as the number of tokens would be higher the accuracy would be lower for the same amount of data.

Another way you can approach it is by running 3 models in parallel. Language detection can be a separate pipeline. I am assuming that you are doing this in a business context. So you can have 3 models running in parallel, one for language detection and one each for the two languages. Based on the output of the language model you can pick the respective ASR output.

becks · April 12, 2021, 7:24am

Ya i see alot people did fine tune on single language. So was wondering why no one did for 2 languages in one model?

Can this xls-r do language detect? Or is there any language model that does audio language detect?

SaraSadeghi · June 25, 2022, 12:54pm

Hi dear @becks
have you done anything to solve your problem?
I have the same problem I want to have a model with both Persian and English characters as you know it’s almost possible in kaldi-based models but with Wav2vec2 I have no idea !
I want to ask @patrickvonplaten is it possible for Wav2vec2 to finetune the model on two different languages simultaneously?

patrickvonplaten · July 26, 2022, 3:32pm

Hey,

Sorry to reply so late here!

It’s indeed very much possible to fine-tune the model on multiple languages simultaneously. @anton-l has done so for the XLS-R model which can be found in this directory I think: anton-l/xtreme_s_xlsr_300m_mls · Hugging Face

In short you just need to be sure that the character vocab contains all characters of your two languages and there shouldn’t be a problem!

The other possibility would be to use a phoneme-based Wav2Vec2 model instead: Wav2Vec2Phoneme

Topic		Replies	Views
Wav2vec2-large-xlsr-53 for non-listed low resource language 🤗Transformers	1	486	May 11, 2021
Baseline vs language-specific finetuned model for multilingual speech recognition 🤗Transformers	0	313	September 20, 2022
Multilingual Finetuning XLS-R 🤗Transformers	1	388	January 11, 2022
Wav2vec2-XLS-R Language Identification downstream task weights Community Calls	0	946	March 31, 2022
Fine-tuning wav2vec2-xls-r-300m for more than one language recognition Models	0	37	July 29, 2024

Wav2vec2-large-xlsr-53

Related topics