[Open-to-the-community] Robust Speech Recognition Challenge

patrickvonplaten · January 12, 2022, 4:07pm

Robust speech recognition in 70+ Languages

Hi all,

We are scaling multi-lingual speech recognition systems - come join us for the robust speech community event from Jan 24th to Feb 7th. With compute provided by OVHcould, we are going from 50 to 70+ languages, from 300M to 2B parameters models, and from toy evaluation datasets to real-world audio evaluation.

What it is about

The goal of the event is to provide robust speech recognition systems in as many languages as possible to the community. We hope that especially low-resource languages will profit from this event.

The main components of the speech recognition event consist of:

Meta AI’s state-of-the-art XLS-R model
Common Voice newest datasets 7 & 8
Language model boosted decoding with Wav2Vec2
Real-world audio for evaluation

How does it work

Participants have two weeks to build as many robust speech recognition systems in as many languages as they want. In general, speech recognition systems can consist of:

Fine-tuned speech recognition checkpoints (e.g. XLS-R)
Language model boosted decoders (e.g. pyctcdecode + n-gram)
Pre- and post-processing modules, such as noise-canceling, spelling correction, …

During the event, you will have the opportunity to work on each of these components to build speech recognition systems in your favorite language!

Each speech recognition system will automatically be evaluated on real-world audio (if available for the language). After the fine-tuning week, the best-performing systems of each language will receive SWAG.

What do I need to do to participate

To participate, simply fill out this short google form. You will also need to create a Hugging Face Hub account here and join our discord here - when joining the event’s discord channel please make sure to click on the emoji under the first message to access all relevant information. OVHcloud kindly offered to provide a limited about of GPUs for participants if needed - if you would like to have access to a GPU, please join the discord for more information*. Here are a some in-detail videos on how to get started with setting up an OVHcloud account.

This fine-tuning week should be especially interesting to native speakers of low-resource languages. Your language skills will help you select the best training data, and possibly build the best existing speech recognition system in your language.

More in-detail information will be announced in the discord channel. We are looking forward to seeing you there!

What do I get

enjoy a bit of Hugging Face vibe
learn how to build state-of-the-art speech recognition systems
free compute to build a powerful fine-tuned model under your name on the Hub
hugging face SWAG if you manage to have the best performing model in a language
100 GPU hours from OVHcloud if you manage to have the best performing model in a language

Open-sourcely yours,

Anton, Omar, Nico & Patrick

Modfiededition · January 12, 2022, 5:50pm

Hey, is there any blog/resource available where I can learn how to build an Audio dataset for your own native language. So, that I can building speech recognition system during the event on the language that I want to work on.

Also very excited for the event

patrickvonplaten · January 12, 2022, 6:15pm

Hey @Modfiededition,

That’s a great question!
As a start, I think it always makes sense to see what datasets are already publicly available that you could use for your language. You could, e.g. see the Hugging Face Hub here and select the speech-processing tag: Hugging Face – The AI community building the future. and then also your favorite language tag → then you can see which datasets are available through the Hub.

Apart from this you can also check out this github page that lists a lot of publicly available speech datasets: GitHub - jim-schwoebel/voice_datasets: 🔊 A comprehensive list of open-source datasets for voice and sound computing (95+ datasets).

If you want to extract audio/transcripts yourself, this is much more difficult and you also need to be careful about licensing. YouTube could be a good source, if the licenses allow it

mrm8488 · January 12, 2022, 6:52pm

Great initiative! I am in!

RASMUS · January 12, 2022, 9:03pm

@patrickvonplaten
In Fine-Tune XLSR-Wav2Vec2 for low-resource ASR with 🤗 Transformers
at correcting the “kenlm.arpa” file script

For me it was not working, the spacing to detect "0 < s > "line was incorrect

patrickvonplaten · January 12, 2022, 9:20pm

Hey @RASMUS,

Could you maybe open an issue on transformers or the blog with all the code to reproduce your error? More than happy to help you there then!

RASMUS · January 12, 2022, 9:55pm

Example script to edit kenlm arpa file does not work correctly in kaggle notebook · Issue #15128 · huggingface/transformers · GitHub Opened issue here @patrickvonplaten

bayartsogt · January 13, 2022, 1:53am

Thank you for the great initiative! I’d love to participate and push Mongolian open-source SST even further!

Each speech recognition system will automatically be evaluated on real-world audio (if available for the language).

Is there any list for languages those “will” be evaluated on real-world audio?

Also is there any restriction for data source?

cstorm125 · January 13, 2022, 5:34am

@patrickvonplaten super hyped to join! Question evaluation metrics. Some languages like Thai needs word tokenization. These tokenizations usually have many standards. Would it be better to use character-based metrics in this case (CER instead of WER). Example: airesearch/wav2vec2-large-xlsr-53-th · Hugging Face

patrickvonplaten · January 13, 2022, 8:08am

Great thanks!

patrickvonplaten · January 13, 2022, 8:11am

Hey @bayartsogt,

That’s a great question! We are trying to get real-world audio for as many languages as possible. Currently we have real world audio for ca. 30 languages. We’ll try to find something good for Mongolian as well

It would be great if you could not include the Common Voice “test” data split of your preferred language in the training data. Besides from that there is no restriction

patrickvonplaten · January 13, 2022, 8:12am

Hey @cstorm125 ,

Very much agree! For certain tokenized languages we will evaluate on CER instead of WER. The CER metric is already available in datasets: Hugging Face – The AI community building the future.

patrickvonplaten · January 13, 2022, 8:24am

Please don’t forget to join discord under this link: Hugging Face

RASMUS · January 13, 2022, 8:46am

Is this going to be individual event or is there option to team-up like in flax event?

patrickvonplaten · January 13, 2022, 3:26pm

We’ll evaluate models individually and also hand out GPU compute for individuals. However, we do encourage participants to build teams on their own if they want to and think it helps improve their models

AigizK · January 14, 2022, 5:29pm

My language(bashkir) is in the CommonVoice7. Can I use this data and train a model?

bayartsogt · January 16, 2022, 6:51am

Thank you so much @patrickvonplaten for answering my questions.
Hope we can finish strong!

polodealvarado · January 19, 2022, 11:41am

Great!
I am quite interested in learning about state-of-the-art speech recognition systems.
I am in!

mpierrau · January 19, 2022, 3:53pm

Hi! Great concept, really looking forward to learning a lot about robust speech systems!

I have one question: In the post above you state that the event consists of using “Common Voice newest datasets 7 & 8” – Will Common Voice 8.0 be revealed/presented during this event? As far as I can tell, version 8 is not available on the CV website, and there are no allusions or references to it being upcoming from Mozilla or anywhere else in the event. The releases seem to have a 6 month interval, so I suppose it is about time for the next version

Topic		Replies	Views
[Open-to-the-community] XLSR-Wav2Vec2 Fine-Tuning Week for Low-Resource Languages Languages at Hugging Face	411	17417	December 9, 2021
[Open-to-the-community] Whisper fine-tuning event Community Calls	31	12048	December 10, 2023
Community content of the week (01/13/2022) Community Calls	0	1736	January 13, 2022
Community content of the week (01/20/2022) Community Calls	0	1847	January 20, 2022
🌟 Weights & Biases - Supporting Wave2Vec2 Finetuning! Languages at Hugging Face	4	1750	March 23, 2021