Japanese ASR: Fine-Tuning Wav2vec2

Has anyone had any luck training a Wav2Vec2 model with Japanese (or any language with a large number of characters in the alphabet like Chinese)?

I was interested to see how a naive approach would perform so tried the standard training script on a fairly powerful box using the Common Voice Japanese dataset and always hit OOM errors on the first Eval stage. The dataset uses Kanji for the labelled text sections, so the resulting vocab is pretty big which I suspect is the problem, but might also make a resulting model perform badly due to the different pronunciations a single Kanji can carry in different contexts.

For those not familiar with Japanese, it uses a mixture of two phonetic alphabets, Hiragana and Katakana (the second mostly used for phonetically representing foreign imported words) where the characters have specific sounds, and Kanji which can have different pronunciations depending on context (here is a nice example).

There are 46 characters each in Hiragana and Katakana (along with accent type marks that extend it to 71 distinct sounds) while there are tens of thousands of different Kanji.

I was thinking of seeing whether I could first map the texts to a Hiragana and Katakana alphabet and train the model to produce a transcript only using the phonetic characters and then perhaps use a language model to try and convert them back to Kanji but I was interested if anyone else had tried approaching this problem.

I see there’s a few libraries out there for handling the text conversion.

for example.

Trying to train some other language models at the moment, but will give this approach a go and update if I can get it working.