Indonesian ASR: Fine-Tuning Wav2Vec2

Hi everyone,

I’m planning to fine-tune XLSR-Wav2Vec2 in Indonesian. We have around 9 hours of validated voice data and 17 hours of overall voice data for Indonesian according to Common Voice. Let’s share tips/collaborations here :smiley:


I am running the fine tuning in Colab, it should take around 4 hours and 30 minutes. Let see tomorrow when it is successfully done :slight_smile:
WER after 90 minutes is about 0.48. Don’t know what is a good WER for indonesian ASR.

I’ve finished running the fine tuning with basic hyperparams (based on the Turkish fine-tuning notebook) and got 0.41 for the WER. Will try to tweak them further this weekend.

I’ve only taken a look at the abstract so don’t know yet the details of the dataset & methods they used but one research ANALISIS AKURASI SISTEM PENGENALAN SUARA PADA KALIMAT BAHASA INDONESIA obtained < 0.10 WER

Meanwhile I get 0.419 for the WER after 3600 steps. Thanks for link to the paper, unfortunately I can only read the abstract, title, table of content and its bibliography, but not the content it self. However, it is not possible to compare the WER, if we don’t know the data set they used.

I also finished the fine tuning notebook from the turkish and got 0.410 WER after 3600 steps. Actually, can we add additional data to training data which already augmented by ourselves?

Because I was downloading the test data from Common Voice, and found that the audio data quite noisy. Also, the total of transcripted audio are only 6-7k audio, and still there are 8k audio data that hasn’t been transcripted meanwhile the the total of the audio data is around 15k audio.

Maybe you mean after 3600 steps, which is almost 30 epochs :slight_smile:
I read somewhere that we can use any additional dataset for training as long as we don’t use the test set

Thanks for the correction there :smiley:
Okay, Thank you for the information

Ok here is the discussion about the additional dataset

1 Like

Mind if I write your names @Galuh @cahya into the spreadsheet list of xlsr-fine-tuning-contrinbutors?

feel free to add it. where is the list btw?

@cahya The list is here: xlsr-fine-tuning-contrinbuters - Google Sheets I’ve added your name and my name

By the way, it’s confirmed that we can use additional dataset for training: transformers/ at master · huggingface/transformers · GitHub

Perhaps we can use these additional datasets?

1 Like

That would be great if we could have additional datasets

I’m currently working on collecting closed captions from YouTube. I will post the dataset when its done :smile:

1 Like

Is it legal to collect dataset from youtube? If it’s legal I have several wav files + transcripts of podcast I gathered from youtube

Is there anyone knows how to compare the result of STT with language corpus?
i.e, the result of STT is “saya punya dua rumah” and after compared with corpus resulted with “saya punya 2 rumah”

We can filter by license. But i’ve seen some dataset from YouTube like GitHub - snakers4/open_stt: Open STT

Okay thanks, found how to filter the license

1 Like

I am trying to get asr dataset from BPPT (Data Wicara) mentioned in this journal paper View of Uji Coba Korpus Data Wicara BPPT sebagai Data Latih Sistem Pengenalan Wicara Bahasa Indonesia . The dataset has about 92 hours of audio. It is not available for public at the moment, but it can be requested officially by an institution. Let’s see if I could get it as individual person :slight_smile:

Hopefully they want to share their dataset :smile:

Btw, how about to combine our effort and put the models under an organisation name instead of our individual name? I prepared already an HF organisation: indonesian-nlp (Indonesian NLP). Please feel free to join if you like.