PreTrain Wav2Vec2 in Indonesian

PreTrain Wav2Vec2 in Indonesian

There is currently only a multilingually pre-trained model for Indonesian Wav2Vec2. Therefore we would like to pre-train Wav2Vec2 with only Indonesian datasets.

Model

A randomly initialized Wav2Vec2 model (if possible the large model)

Datasets

In addition to the Indonesian Common Voice (18h), we have also collected the following Indonesian speech datasets:

  • Wavenet Synthetic Voice (>400h)
  • TIML-IDN (14.5h)
  • Bible.is (40h)
  • Podcast (>10kh)

Available training scripts

FlaxWav2Vec2 will be merged soon: [Flax] Add wav2vec2 by patrickvonplaten · Pull Request #12271 · huggingface/transformers · GitHub and a pretraining script should be relatively easy to be merged.

(Optional) Desired project outcome

The best Indonesian ASR model :slight_smile:

Team

We have a team from the last wav2vec2 event:

8 Likes

Wuhuuu - super excited about this! I’ll make sure a pretraining script is ready until Thursday.

Finalizing it!

1 Like