[Open-to-the-community] Robust Speech Recognition Challenge

Robust speech recognition in 70+ Languages :studio_microphone::earth_africa:

Hi all,

We are scaling multi-lingual speech recognition systems - come join us for the robust speech community event from Jan 24th to Feb 7th. With compute provided by OVHcould, we are going from 50 to 70+ languages, from 300M to 2B parameters models, and from toy evaluation datasets to real-world audio evaluation.

What it is about :question:

The goal of the event is to provide robust speech recognition systems in as many languages as possible to the community. We hope that especially low-resource languages will profit from this event.

The main components of the speech recognition event consist of:

How does it work :gear:

Participants have two weeks to build as many robust speech recognition systems in as many languages as they want. In general, speech recognition systems can consist of:

  • Fine-tuned speech recognition checkpoints (e.g. XLS-R)
  • Language model boosted decoders (e.g. pyctcdecode + n-gram)
  • Pre- and post-processing modules, such as noise-canceling, spelling correction, …

During the event, you will have the opportunity to work on each of these components to build speech recognition systems in your favorite language!

Each speech recognition system will automatically be evaluated on real-world audio (if available for the language). After the fine-tuning week, the best-performing systems of each language will receive :hugs: SWAG.

What do I need to do to participate :clipboard:

To participate, simply fill out this short google form. You will also need to create a Hugging Face Hub account here and join our discord here - when joining the event’s discord channel please make sure to click on the :hugs: emoji under the first message to access all relevant information. OVHcloud kindly offered to provide a limited about of GPUs for participants if needed - if you would like to have access to a GPU, please join the discord for more information*. Here are a some in-detail videos on how to get started with setting up an OVHcloud account.

This fine-tuning week should be especially interesting to native speakers of low-resource languages. Your language skills will help you select the best training data, and possibly build the best existing speech recognition system in your language.

More in-detail information will be announced in the discord channel. We are looking forward to seeing you there!

What do I get :gift:

  • enjoy a bit of Hugging Face vibe
  • learn how to build state-of-the-art speech recognition systems
  • free compute to build a powerful fine-tuned model under your name on the Hub
  • hugging face SWAG if you manage to have the best performing model in a language
  • 100 GPU hours from OVHcloud if you manage to have the best performing model in a language

Open-sourcely yours,

Anton, Omar, Nico & Patrick


Hey, is there any blog/resource available where I can learn how to build an Audio dataset for your own native language. So, that I can building speech recognition system during the event on the language that I want to work on.

Also very excited for the event :blush:


Hey @Modfiededition,

That’s a great question!
As a start, I think it always makes sense to see what datasets are already publicly available that you could use for your language. You could, e.g. see the Hugging Face Hub here and select the speech-processing tag: Hugging Face – The AI community building the future. and then also your favorite language tag → then you can see which datasets are available through the Hub.

Apart from this you can also check out this github page that lists a lot of publicly available speech datasets: GitHub - jim-schwoebel/voice_datasets: 🔊 A comprehensive list of open-source datasets for voice and sound computing (95+ datasets).

If you want to extract audio/transcripts yourself, this is much more difficult and you also need to be careful about licensing. YouTube could be a good source, if the licenses allow it


Great initiative! I am in!

1 Like

In Fine-Tune XLSR-Wav2Vec2 for low-resource ASR with 🤗 Transformers
at correcting the “kenlm.arpa” file script

For me it was not working, the spacing to detect "0 < s > "line was incorrect


Could you maybe open an issue on transformers or the blog with all the code to reproduce your error? More than happy to help you there then!

1 Like

Example script to edit kenlm arpa file does not work correctly in kaggle notebook · Issue #15128 · huggingface/transformers · GitHub Opened issue here @patrickvonplaten

1 Like

Thank you for the great initiative! I’d love to participate and push Mongolian open-source SST even further!

Each speech recognition system will automatically be evaluated on real-world audio (if available for the language).

Is there any list for languages those “will” be evaluated on real-world audio?

Also is there any restriction for data source?

@patrickvonplaten super hyped to join! Question evaluation metrics. Some languages like Thai needs word tokenization. These tokenizations usually have many standards. Would it be better to use character-based metrics in this case (CER instead of WER). Example: airesearch/wav2vec2-large-xlsr-53-th · Hugging Face

Great thanks!

Hey @bayartsogt,

That’s a great question! We are trying to get real-world audio for as many languages as possible. Currently we have real world audio for ca. 30 languages. We’ll try to find something good for Mongolian as well :slight_smile:

It would be great if you could not include the Common Voice “test” data split of your preferred language in the training data. Besides from that there is no restriction

1 Like

Hey @cstorm125 ,

Very much agree! For certain tokenized languages we will evaluate on CER instead of WER. The CER metric is already available in datasets: Hugging Face – The AI community building the future.

Please don’t forget to join discord under this link: Hugging Face :slight_smile:

Is this going to be individual event or is there option to team-up like in flax event?

1 Like

We’ll evaluate models individually and also hand out GPU compute for individuals. However, we do encourage participants to build teams on their own if they want to and think it helps improve their models :slight_smile:


My language(bashkir) is in the CommonVoice7. Can I use this data and train a model?

Thank you so much @patrickvonplaten for answering my questions.
Hope we can finish strong! :slightly_smiling_face:

1 Like

I am quite interested in learning about state-of-the-art speech recognition systems.
I am in!

Hi! Great concept, really looking forward to learning a lot about robust speech systems!

I have one question: In the post above you state that the event consists of using “Common Voice newest datasets 7 & 8” – Will Common Voice 8.0 be revealed/presented during this event? As far as I can tell, version 8 is not available on the CV website, and there are no allusions or references to it being upcoming from Mozilla or anywhere else in the event. The releases seem to have a 6 month interval, so I suppose it is about time for the next version :slight_smile:

1 Like