I am finetuning a whisper model for the Dutch language. When running the Dutch pipeline on some commonvoice data I get an initial WER of 14%, seems reasonable. When finetuning I get an initial WER of 19% which seems a contradiction with the 14%. When checking where the heightened error comes from I see that my predictions often double some byte pairs. For example when a sentence starts with “het” my predictor predicts the “het” token followed by the “et” token. This is wrong, when I look at the labels it should predict the “h” token and afterwards the “et” token. What have I configured wrongly here?
Here are some examples of the error (pred_ids, labels_ids, pred_strs, label_strs):
[50271 50271 50360 50364 27832 1718 390 3881 308 10553 31647 1601
13 50257 50257 50257 50257 250 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0]
[50258 50271 50360 50364 39 1718 390 3881 308 10553 31647 1601
13 50257 50257 50257 50257 50257 50257 50257 50257 50257 50257 50257
50257 50257 50257 50257 50257 50257 50257 50257 50257 50257 50257 50257
50257]
hijij was een echte vakman
hij was een echte vakman
[50271 50271 50360 50364 12045 302 367 25868 25329 6592 85 4326
1479 294 3881 277 664 15615 74 13 50257 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0]
[50258 50271 50360 50364 39 302 367 25868 25329 6592 85 4326
1479 294 3881 277 664 15615 74 13 50257 50257 50257 50257
50257 50257 50257 50257 50257 50257 50257 50257 50257 50257 50257 50257
50257]
hetet rieten dak ontvlamde in een oogwenk
het rieten dak ontvlamde in een oogwenk
[50271 50271 50360 50364 1346 47237 372 303 35963 287 4698 1189
582 3589 328 10317 13 50257 50257 50257 50257 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0]
[50258 50271 50360 50364 11089 47237 372 303 35963 287 4698 1189
582 3589 328 10317 13 50257 50257 50257 50257 50257 50257 50257
50257 50257 50257 50257 50257 50257 50257 50257 50257 50257 50257 50257
50257]
de rijstvelden lagen er prachtig bij
de rijstvelden lagen er prachtig bij
[50271 50271 50360 50364 12045 302 3342 335 28836 9638 1269 13
50257 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0]
[50258 50271 50360 50364 39 302 3342 335 28836 9638 1269 13
50257 50257 50257 50257 50257 50257 50257 50257 50257 50257 50257 50257
50257 50257 50257 50257 50257 50257 50257 50257 50257 50257 50257 50257
50257]
hetet raam staat nog open
het raam staat nog open
I have this tokenizer, feature extractor and model:
tokenizer = WhisperTokenizer.from_pretrained(model_id, language="Dutch", task="transcribe")
processor = WhisperProcessor.from_pretrained(model_id, language="Dutch", task="transcribe")
feature_extractor = WhisperFeatureExtractor.from_pretrained(model_id)
model = WhisperForConditionalGeneration.from_pretrained(model_id, device_map="auto")
I thought maybe it was some forced decoder ids, so I was tinkering with that:
forced_decoder_ids = tokenizer.get_decoder_prompt_ids(language="Dutch", task="transcribe")
model.config.forced_decoder_ids = [[0, 50271],[1, 50271],[2, 50360],[3, 50364]] #forced_decoder_ids
# model.config.suppress_tokens = [50271,50271,50360,50364, 50257]
Any help greatly appreciated.