Swedish ASR: Fine Tuning Wav2Vec2

Hey everyone. I trained the model in Swedish (just on the standard params) and I’m curious if we could figure out a good way to fine tune the parameters.
My WER after 4000 steps was 0.511916 on a dataset of 402mb.
I created a spreadsheet, maybe if people could fill out some parameters on how we trained we could figure out better parameters for training. :heart:

Here is a link to my Google Colaboratory.

I ran the same (didn’t change any parameters but did filter out apostrophe) tonight and got WER of 0.514714.

Out of curiosity I took the inference part from the notebook and looped out the predicted text together with the original text.

for i in range(len(common_voice_test["input_values"])):
  input_dict = processor(common_voice_test["input_values"][i], return_tensors="pt", padding=True)

  logits = model(input_dict.input_values.to("cuda")).logits

  pred_ids = torch.argmax(logits, dim=-1)[0]
  print(str(i)+"\t"+processor.decode(pred_ids) + "\t" + common_voice_test_transcription["sentence"][i].lower())

I got 76 lines (out of 2027) before Colaboratory disconnected me. I had to remove 76 warnings about sampling_rate not being provided. I tried to send sampling_rate to 16000 but then it looked I got a different prediction result (but it could just being that you get different prediction result each time you run).

Did any of you start looking at using the NST database (I can see that it’s listed in the sheet)? Maybe this would be good to collaborate on?

The KB labb trained a model using the NST database that currently has the lowest WER

The model was trained only on NST so a good next step might be to train on both NST and Common Voice