I’m fine tuning XLSR-Wav2Vec2 on a 200+ hours of a speech in a language not in the original pertaining.
The training progresses nicely, however when it reaches about 40 WER it starts to overfit (WER doesn’t progress much and train loss decreases while eval loss is going up).
I’ve tried increasing some params of the SpecAugment, but it only helped a bit.
I’ve noticed that using the Speechbrain lib implementation I’m getting a bit better results (on the expense of training stability) and was wondering if it is due to a larger vocabulary they use there. Does anyone tried to use a tokenizer with a vocabulary that contains subwords and words in addition to characters? I could’t find any experiment that uses it with Huggingface transformers W2V2.
I see in the Wav2Vec 2 paper they say that:
We expect performance gains by switching to a seq2seq architecture and a
word piece vocabulary.
Any suggestions on how to do that with Huggingface Transformers?
P.S. my dataset is noisy and not super clean.
Any help or suggestion will be very helpful.