Russian ASR: Fine-tuning Wav2Vec2

arkadyark · March 20, 2021, 7:58am

Hello everybody! Creating a thread to organize the work on Russian ASR

Data:

Common voice (111 hours validated)
CSS10 Russian: Single Speaker Speech Dataset (small, 440 utterances, available at Russian Single Speaker Speech Dataset | Kaggle)
Open STT (GitHub - snakers4/open_stt: Open STT very large, ~20k hours, multi-domain, probably too use in full)

Progress:
So far, @gorodecki has begun training on a subset of the common voice dataset. I tried briefly to run on the whole dataset on colab, but quickly ran out of memory so I’ll have to revisit that.

Another thought I had that could be a stretch goal was to use this Russian model as a base from which to further fine-tune on similar slavic languages/dialects that may be lower resource. I was thinking of Belarusian, Ukrainian, Kazakh Russian, any others?

Tagging those involved so far (tag anybody else who would like to join us as well!): @gorodecki @vladdy

gorodecki · March 20, 2021, 8:38am

Hello!
My result is 0,38 WER on part of common voice.

anton-l · March 20, 2021, 7:41pm

Hi everyone! Let’s see what we can do in a week

arkadyark · March 21, 2021, 7:56pm

Nice, that’s a great starting point! I think I have access to a GPU machine with more disk space, so I’ll try overnight tonight to see if I can download the whole dataset onto there and see what results I get with the full training set.

How long did training on 10% take for you with Colab?

gorodecki · March 21, 2021, 10:12pm

I’m use local machine with T4.
It takes approximately 4 hours on one GPU.
Unfortunately, training cannot be started on 4 GPUs at parallel mode.

arkadyark · March 21, 2021, 10:41pm

I see - I have access to a machine with 2 GPUs (P100) but also ran into an error trying to run it with DistributedDataParallel. I’m trying to run it now with just 1GPU with the full common voice dataset.

Did you also get this same error?

“RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by (1) passing the keyword argument find_unused_parameters=True to torch.nn.parallel.DistributedDataParallel; (2) making sure all forward function outputs participate in calculating loss. If you already have done the above two steps, then the distributed data parallel module wasn’t able to locate the output tensors in the return value of your module’s forward function. Please include the loss function and the structure of the return value of forward of your module when reporting this issue (e.g. list, dict, iterable).”

gorodecki · March 22, 2021, 12:24pm

No, my error is “CUDA not enough memory”
I run train jupyter notebook.

gorodecki · March 22, 2021, 1:18pm

Can you view this parametr in your training_args ?

arkadyark · March 22, 2021, 10:57pm

I’ll give that a try! In the meantime - my training run on the whole dataset on a single GPU has just about finished running for 5 epochs, finishing with a WER on the validation set of about 39%. I’d hoped it would do better than your smaller split, but it seems like it didn’t.

The WER does still seem to be decreasing, although slowly, so maybe more training time might help but I’m not sure. This first model is a good starting point to iterate on.

arkadyark · March 23, 2021, 7:03am

Ok, a couple more updates:

When I ran with 2GPUs and had an error, training_args.parallel_mode was equal to ParallelMode.DISTRIBUTED, so I’m not sure what’s going wrong there
I finished running training on common_voice for 5 epochs on a single GPU, my final WER on the evaluation set remained around 0.39. In the end, training took about 19 hours.

As a next step, I may try to load in a portion of the OpenSTT dataset and add this to the training set to see if it improves performance. Any other ideas?

gorodecki · March 23, 2021, 3:43pm

Good work!

@anton-l create this script anton-l/wav2vec2-large-xlsr-53-russian · Hugging Face.
You test sentence_clean for evaluate!

ёжик == ежик==:hedgehog: in russian language

arkadyark · March 24, 2021, 10:14am

Thanks for sharing, I’ll evaluate on the test set and share my results. @anton-l I see you got a WER of 22.4% - can you share more about the model you trained? Was it trained with the whole common-voice dataset?

arkadyark · March 24, 2021, 10:38am

Running the evaluation script on my model trained on all of common voice, I get a WER on the test set of 30.6

anton-l · March 26, 2021, 10:25am

I tried augmenting the Ukrainian dataset by phonetically translating the Russian one (2+ times larger) using some crude transliteration rules for passport names.
Didn’t help the WER even one bit, but it was fun to try nevertheless

anton-l · March 26, 2021, 10:32am

Also, Open STT turned out to be too noisy (both in term of background sounds and annotation) to have any positive effect on Common Voice test metrics, so probably not worth it just yet.

gorodecki · March 26, 2021, 11:29am

I think before training on noisy data, you need to get a WER of about 10% in order for the probabilities to be high.

gorodecki · March 27, 2021, 8:14pm

What metric updates are there? What new experiments have been successful?

My result is 29% WER on CommonVoice.

arkadyark · March 28, 2021, 9:37pm

Hi! I got similar results, and didn’t have a chance to train more models after that - it seems like @anton-l 's model that got 22% was the best one!

anton-l · March 29, 2021, 7:11am

After longer training (60 epochs) and augmentations I was able to squeeze out 17.39 WER! Check out the details on Slack

arkadyark · May 21, 2021, 3:43pm

Just found another new corpus for Russian ASR here, that was able to get as low as 8% WER on Common Voice test set. Would be cool to try to add this together with the dataset we trained on, and see if we can beat that!

https://github.com/sberdevices/golos

Topic		Replies	Views
Pretrain Wav2vec2 in Russian Flax/JAX Projects	2	1109	July 1, 2021
Spanish ASR: Fine-Tuning Wav2Vec2 Languages at Hugging Face	9	3015	March 26, 2021
Swedish ASR: Fine Tuning Wav2Vec2 Models	4	869	March 23, 2021
Turkish ASR: Fine-Tuning Wav2Vec2 Languages at Hugging Face	9	3317	May 31, 2021
Polish ASR: Fine-Tuning Wav2Vec2 Languages at Hugging Face	0	442	March 19, 2021

Russian ASR: Fine-tuning Wav2Vec2

Related topics