PreTrain Wav2Vec2 in Turkish
On the previous community event I pretrained a Turkish Wav2Vec2 model using the fairseq script. This is a subpar model because I hadn’t cleaned the data.
This time I want to do it properly with the freshly merged FlaxWav2Vec2 + PreTraining script
Model
A randomly initialized Wav2Vec2 model
Datasets
One can make use Common Voice the dataset is also available through the datasets
library here: common_voice · Datasets at Hugging Face.
Available training scripts
FlaxWav2Vec2 will be merged soon: [Flax] Add wav2vec2 by patrickvonplaten · Pull Request #12271 · huggingface/transformers · GitHub and a pretraining script should be relatively easy to be merged.
(Optional) Desired project outcome
The best Turkish ASR model.
(Optional) Challenges
I have some additional scraped audiobook data. Might need a bit more though.