Whisper Finetuning Dutch: Weird double characters

I am finetuning a whisper model for the Dutch language. When running the Dutch pipeline on some commonvoice data I get an initial WER of 14%, seems reasonable. When finetuning I get an initial WER of 19% which seems a contradiction with the 14%. When checking where the heightened error comes from I see that my predictions often double some byte pairs. For example when a sentence starts with “het” my predictor predicts the “het” token followed by the “et” token. This is wrong, when I look at the labels it should predict the “h” token and afterwards the “et” token. What have I configured wrongly here?

Here are some examples of the error (pred_ids, labels_ids, pred_strs, label_strs):

[50271 50271 50360 50364 27832  1718   390  3881   308 10553 31647  1601
    13 50257 50257 50257 50257   250     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0]
[50258 50271 50360 50364    39  1718   390  3881   308 10553 31647  1601
    13 50257 50257 50257 50257 50257 50257 50257 50257 50257 50257 50257
 50257 50257 50257 50257 50257 50257 50257 50257 50257 50257 50257 50257
 50257]
hijij was een echte vakman
hij was een echte vakman
 
[50271 50271 50360 50364 12045   302   367 25868 25329  6592    85  4326
  1479   294  3881   277   664 15615    74    13 50257     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0]
[50258 50271 50360 50364    39   302   367 25868 25329  6592    85  4326
  1479   294  3881   277   664 15615    74    13 50257 50257 50257 50257
 50257 50257 50257 50257 50257 50257 50257 50257 50257 50257 50257 50257
 50257]
hetet rieten dak ontvlamde in een oogwenk
het rieten dak ontvlamde in een oogwenk
 
[50271 50271 50360 50364  1346 47237   372   303 35963   287  4698  1189
   582  3589   328 10317    13 50257 50257 50257 50257     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0]
[50258 50271 50360 50364 11089 47237   372   303 35963   287  4698  1189
   582  3589   328 10317    13 50257 50257 50257 50257 50257 50257 50257
 50257 50257 50257 50257 50257 50257 50257 50257 50257 50257 50257 50257
 50257]
de rijstvelden lagen er prachtig bij
de rijstvelden lagen er prachtig bij
 
[50271 50271 50360 50364 12045   302  3342   335 28836  9638  1269    13
 50257     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0]
[50258 50271 50360 50364    39   302  3342   335 28836  9638  1269    13
 50257 50257 50257 50257 50257 50257 50257 50257 50257 50257 50257 50257
 50257 50257 50257 50257 50257 50257 50257 50257 50257 50257 50257 50257
 50257]
hetet raam staat nog open
het raam staat nog open

I have this tokenizer, feature extractor and model:

tokenizer = WhisperTokenizer.from_pretrained(model_id, language="Dutch", task="transcribe")

processor = WhisperProcessor.from_pretrained(model_id, language="Dutch", task="transcribe")

feature_extractor = WhisperFeatureExtractor.from_pretrained(model_id)

model = WhisperForConditionalGeneration.from_pretrained(model_id, device_map="auto")

I thought maybe it was some forced decoder ids, so I was tinkering with that:

forced_decoder_ids = tokenizer.get_decoder_prompt_ids(language="Dutch", task="transcribe")
model.config.forced_decoder_ids = [[0, 50271],[1, 50271],[2, 50360],[3, 50364]] #forced_decoder_ids 
# model.config.suppress_tokens = [50271,50271,50360,50364, 50257]

Any help greatly appreciated.

Actually, with some additional research, I have noticed that this behaviour is also present in the default huggingface whisper finetuning tutorial at: Fine-Tune Whisper For Multilingual ASR with 🤗 Transformers

In this tutorial, when only changing the language from Hindi to Dutch, and doing nothing else, it displays the exact same behaviour?

This behaviour does not happen when using whisper in the huggingface pipeline for inference, i.e.:

model_id = "openai/whisper-large-v3"
language = "Dutch"

# Initialize tokenizer and feature extractor with language setting
feature_extractor = WhisperFeatureExtractor.from_pretrained(model_id)
tokenizer = WhisperTokenizer.from_pretrained(model_id, language=language, task="transcribe")

model = WhisperForConditionalGeneration.from_pretrained(model_id)
forced_decoder_ids = tokenizer.get_decoder_prompt_ids(language=language, task="transcribe")

model = model.half()
model.config.forced_decoder_ids = forced_decoder_ids

asr_pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    feature_extractor=feature_extractor,
    tokenizer=tokenizer,
    chunk_length_s=30,
    stride_length_s=(4, 2),
    torch_dtype=torch_dtype,
    device=device,
)

I do not understanding the mechanism by which this is caused. But I have noticed that

forced_decoder_ids = processor.tokenizer.get_decoder_prompt_ids(language="Dutch", task="transcribe")

and ensuring you are on transformers 4.37.2 resolves this. Setting the forced decoder prompt ids currently does not work on the dev branch.