PreTrain Wav2Vec2 in Indonesian
There is currently only a multilingually pre-trained model for Indonesian Wav2Vec2. Therefore we would like to pre-train Wav2Vec2 with only Indonesian datasets.
Model
A randomly initialized Wav2Vec2 model (if possible the large model)
Datasets
In addition to the Indonesian Common Voice (18h), we have also collected the following Indonesian speech datasets:
- Wavenet Synthetic Voice (>400h)
- TIML-IDN (14.5h)
- Bible.is (40h)
- Podcast (>10kh)
Available training scripts
FlaxWav2Vec2 will be merged soon: [Flax] Add wav2vec2 by patrickvonplaten · Pull Request #12271 · huggingface/transformers · GitHub and a pretraining script should be relatively easy to be merged.
(Optional) Desired project outcome
The best Indonesian ASR model
Team
We have a team from the last wav2vec2 event: