Russian ASR: Fine-tuning Wav2Vec2

Hello everybody! Creating a thread to organize the work on Russian ASR


So far, @gorodecki has begun training on a subset of the common voice dataset. I tried briefly to run on the whole dataset on colab, but quickly ran out of memory so I’ll have to revisit that.

Another thought I had that could be a stretch goal was to use this Russian model as a base from which to further fine-tune on similar slavic languages/dialects that may be lower resource. I was thinking of Belarusian, Ukrainian, Kazakh Russian, any others?

Tagging those involved so far (tag anybody else who would like to join us as well!): @gorodecki @vladdy


My result is 0,38 WER on part of common voice.

Hi everyone! Let’s see what we can do in a week :rocket:

1 Like

Nice, that’s a great starting point! I think I have access to a GPU machine with more disk space, so I’ll try overnight tonight to see if I can download the whole dataset onto there and see what results I get with the full training set.

How long did training on 10% take for you with Colab?

I’m use local machine with T4.
It takes approximately 4 hours on one GPU.
Unfortunately, training cannot be started on 4 GPUs at parallel mode.

I see - I have access to a machine with 2 GPUs (P100) but also ran into an error trying to run it with DistributedDataParallel. I’m trying to run it now with just 1GPU with the full common voice dataset.

Did you also get this same error?

“RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by (1) passing the keyword argument find_unused_parameters=True to torch.nn.parallel.DistributedDataParallel; (2) making sure all forward function outputs participate in calculating loss. If you already have done the above two steps, then the distributed data parallel module wasn’t able to locate the output tensors in the return value of your module’s forward function. Please include the loss function and the structure of the return value of forward of your module when reporting this issue (e.g. list, dict, iterable).”

No, my error is “CUDA not enough memory”
I run train jupyter notebook.

Can you view this parametr in your training_args ?

I’ll give that a try! In the meantime - my training run on the whole dataset on a single GPU has just about finished running for 5 epochs, finishing with a WER on the validation set of about 39%. I’d hoped it would do better than your smaller split, but it seems like it didn’t.

The WER does still seem to be decreasing, although slowly, so maybe more training time might help but I’m not sure. This first model is a good starting point to iterate on.

1 Like

Ok, a couple more updates:

  • When I ran with 2GPUs and had an error, training_args.parallel_mode was equal to ParallelMode.DISTRIBUTED, so I’m not sure what’s going wrong there
  • I finished running training on common_voice for 5 epochs on a single GPU, my final WER on the evaluation set remained around 0.39. In the end, training took about 19 hours.

As a next step, I may try to load in a portion of the OpenSTT dataset and add this to the training set to see if it improves performance. Any other ideas?

Good work!

@anton-l create this script anton-l/wav2vec2-large-xlsr-53-russian · Hugging Face.
You test sentence_clean for evaluate!

ёжик == ежик==:hedgehog: in russian language :blush:

Thanks for sharing, I’ll evaluate on the test set and share my results. @anton-l I see you got a WER of 22.4% - can you share more about the model you trained? Was it trained with the whole common-voice dataset?

Running the evaluation script on my model trained on all of common voice, I get a WER on the test set of 30.6

I tried augmenting the Ukrainian dataset by phonetically translating the Russian one (2+ times larger) using some crude transliteration rules for passport names.
Didn’t help the WER even one bit, but it was fun to try nevertheless :joy:

Also, Open STT turned out to be too noisy (both in term of background sounds and annotation) to have any positive effect on Common Voice test metrics, so probably not worth it just yet.

1 Like

I think before training on noisy data, you need to get a WER of about 10% in order for the probabilities to be high.

:slightly_smiling_face: What metric updates are there? What new experiments have been successful?

My result is 29% WER on CommonVoice.

Hi! I got similar results, and didn’t have a chance to train more models after that - it seems like @anton-l 's model that got 22% was the best one!

After longer training (60 epochs) and augmentations I was able to squeeze out 17.39 WER! :scream_cat: Check out the details on Slack


Just found another new corpus for Russian ASR here, that was able to get as low as 8% WER on Common Voice test set. Would be cool to try to add this together with the dataset we trained on, and see if we can beat that!

1 Like