Hi Everyone,
Recently, I am trying to finetune “facebook/wav2vec2-xls-r-300m” model with some Turkish dataset. I already have an ASR model that we generated with Kaldi, which has a WER around 10%. I was expecting to reach a better WER after adding about 100 hours of data of our own. However, each time I add some additional data, I get worse WER rates. The first time I got a WER of 20%, second time I got 37 %, finally I started a train with an additional data of about 40 hours, and the train started with a WER of 68%, and it decreased to 57% after 3 epoch. Although, until epoch 30 I expect some more decrease, but from my previous train experiences, it only drops around 20 to 30 % from the initial WER.
When I look at others’ pretrained asr models on HF, they all have much more acceptable WER ratios below 20 %.
So, I started to think that I might be doing something wrong.
My guess is that, my dataset that I am using for finetuning is 8khz, whereas almost all pretrained xlsr asr models are 16khz.
During finetuning I am upsampling my data to 16khz.
What I want to know is:
1- Is it OK to do finetuning a 16khz base model, with an 8khz data by upsampling to 16khz?
2- Should I use original 16Khz data for finetuning a 16khz base model?
3- Is it possible to convert a 16 khz base model to 8Khz model without losing performance (namely with the same amount of WER)?
I appreciate any guidance on this issue.
Thanks in advance.