PreTrain Wav2Vec2 in Indonesian

cahya · June 29, 2021, 9:31am

PreTrain Wav2Vec2 in Indonesian

There is currently only a multilingually pre-trained model for Indonesian Wav2Vec2. Therefore we would like to pre-train Wav2Vec2 with only Indonesian datasets.

Model

A randomly initialized Wav2Vec2 model (if possible the large model)

Datasets

In addition to the Indonesian Common Voice (18h), we have also collected the following Indonesian speech datasets:

Wavenet Synthetic Voice (>400h)
TIML-IDN (14.5h)
Bible.is (40h)
Podcast (>10kh)

Available training scripts

FlaxWav2Vec2 will be merged soon: [Flax] Add wav2vec2 by patrickvonplaten · Pull Request #12271 · huggingface/transformers · GitHub and a pretraining script should be relatively easy to be merged.

(Optional) Desired project outcome

The best Indonesian ASR model

Team

We have a team from the last wav2vec2 event:

patrickvonplaten · June 29, 2021, 2:00pm

Wuhuuu - super excited about this! I’ll make sure a pretraining script is ready until Thursday.

Finalizing it!

Topic		Replies	Views
PreTrain Wav2Vec2 in Swedish Flax/JAX Projects	3	963	June 29, 2021
PreTrain Wav2Vec2 in German Flax/JAX Projects	7	1366	July 7, 2021
PreTrain Wav2Vec2 in Dhivehi Flax/JAX Projects	3	1140	July 1, 2021
PreTrain Wav2Vec2 in Spanish Flax/JAX Projects	4	627	July 1, 2021
PreTrain Wav2Vec2 in Persian Flax/JAX Projects	0	1176	July 8, 2021

PreTrain Wav2Vec2 in Indonesian

PreTrain Wav2Vec2 in Indonesian

Model

Datasets

Available training scripts

(Optional) Desired project outcome

Team

Related topics