Hello everybody! Creating a thread to organize the work on Russian ASR
- Common voice (111 hours validated)
- CSS10 Russian: Single Speaker Speech Dataset (small, 440 utterances, available at Russian Single Speaker Speech Dataset | Kaggle)
- Open STT (GitHub - snakers4/open_stt: Open STT very large, ~20k hours, multi-domain, probably too use in full)
So far, @gorodecki has begun training on a subset of the common voice dataset. I tried briefly to run on the whole dataset on colab, but quickly ran out of memory so I’ll have to revisit that.
Another thought I had that could be a stretch goal was to use this Russian model as a base from which to further fine-tune on similar slavic languages/dialects that may be lower resource. I was thinking of Belarusian, Ukrainian, Kazakh Russian, any others?