PreTrain Wav2Vec2 in Turkish

ceyda · July 1, 2021, 2:56pm

PreTrain Wav2Vec2 in Turkish

On the previous community event I pretrained a Turkish Wav2Vec2 model using the fairseq script. This is a subpar model because I hadn’t cleaned the data.
This time I want to do it properly with the freshly merged FlaxWav2Vec2 + PreTraining script

Model

A randomly initialized Wav2Vec2 model

Datasets

One can make use Common Voice the dataset is also available through the datasets library here: common_voice · Datasets at Hugging Face.

Available training scripts

FlaxWav2Vec2 will be merged soon: [Flax] Add wav2vec2 by patrickvonplaten · Pull Request #12271 · huggingface/transformers · GitHub and a pretraining script should be relatively easy to be merged.

(Optional) Desired project outcome

The best Turkish ASR model.

(Optional) Challenges

I have some additional scraped audiobook data. Might need a bit more though.

patrickvonplaten · July 2, 2021, 3:32pm

This is a well-defined topic, let’s define it as well!

Topic		Replies	Views
PreTrain Wav2Vec2 in German Flax/JAX Projects	7	1365	July 7, 2021
PreTrain Wav2Vec2 in Persian Flax/JAX Projects	0	1176	July 8, 2021
PreTrain Wav2Vec2 in Swedish Flax/JAX Projects	3	963	June 29, 2021
PreTrain Wav2Vec2 in Indonesian Flax/JAX Projects	1	366	June 29, 2021
PreTrain Wav2Vec2 in Spanish Flax/JAX Projects	4	627	July 1, 2021

PreTrain Wav2Vec2 in Turkish

PreTrain Wav2Vec2 in Turkish

Model

Datasets

Available training scripts

(Optional) Desired project outcome

(Optional) Challenges

Related topics