Whisper Finetuning Dutch: Weird double characters

oftheaze · January 31, 2024, 1:33pm

I am finetuning a whisper model for the Dutch language. When running the Dutch pipeline on some commonvoice data I get an initial WER of 14%, seems reasonable. When finetuning I get an initial WER of 19% which seems a contradiction with the 14%. When checking where the heightened error comes from I see that my predictions often double some byte pairs. For example when a sentence starts with “het” my predictor predicts the “het” token followed by the “et” token. This is wrong, when I look at the labels it should predict the “h” token and afterwards the “et” token. What have I configured wrongly here?

Here are some examples of the error (pred_ids, labels_ids, pred_strs, label_strs):

[50271 50271 50360 50364 27832  1718   390  3881   308 10553 31647  1601
    13 50257 50257 50257 50257   250     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0]
[50258 50271 50360 50364    39  1718   390  3881   308 10553 31647  1601
    13 50257 50257 50257 50257 50257 50257 50257 50257 50257 50257 50257
 50257 50257 50257 50257 50257 50257 50257 50257 50257 50257 50257 50257
 50257]
hijij was een echte vakman
hij was een echte vakman
 
[50271 50271 50360 50364 12045   302   367 25868 25329  6592    85  4326
  1479   294  3881   277   664 15615    74    13 50257     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0]
[50258 50271 50360 50364    39   302   367 25868 25329  6592    85  4326
  1479   294  3881   277   664 15615    74    13 50257 50257 50257 50257
 50257 50257 50257 50257 50257 50257 50257 50257 50257 50257 50257 50257
 50257]
hetet rieten dak ontvlamde in een oogwenk
het rieten dak ontvlamde in een oogwenk
 
[50271 50271 50360 50364  1346 47237   372   303 35963   287  4698  1189
   582  3589   328 10317    13 50257 50257 50257 50257     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0]
[50258 50271 50360 50364 11089 47237   372   303 35963   287  4698  1189
   582  3589   328 10317    13 50257 50257 50257 50257 50257 50257 50257
 50257 50257 50257 50257 50257 50257 50257 50257 50257 50257 50257 50257
 50257]
de rijstvelden lagen er prachtig bij
de rijstvelden lagen er prachtig bij
 
[50271 50271 50360 50364 12045   302  3342   335 28836  9638  1269    13
 50257     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0]
[50258 50271 50360 50364    39   302  3342   335 28836  9638  1269    13
 50257 50257 50257 50257 50257 50257 50257 50257 50257 50257 50257 50257
 50257 50257 50257 50257 50257 50257 50257 50257 50257 50257 50257 50257
 50257]
hetet raam staat nog open
het raam staat nog open

I have this tokenizer, feature extractor and model:

tokenizer = WhisperTokenizer.from_pretrained(model_id, language="Dutch", task="transcribe")

processor = WhisperProcessor.from_pretrained(model_id, language="Dutch", task="transcribe")

feature_extractor = WhisperFeatureExtractor.from_pretrained(model_id)

model = WhisperForConditionalGeneration.from_pretrained(model_id, device_map="auto")

I thought maybe it was some forced decoder ids, so I was tinkering with that:

forced_decoder_ids = tokenizer.get_decoder_prompt_ids(language="Dutch", task="transcribe")
model.config.forced_decoder_ids = [[0, 50271],[1, 50271],[2, 50360],[3, 50364]] #forced_decoder_ids 
# model.config.suppress_tokens = [50271,50271,50360,50364, 50257]

Any help greatly appreciated.

oftheaze · February 1, 2024, 9:54am

Actually, with some additional research, I have noticed that this behaviour is also present in the default huggingface whisper finetuning tutorial at: Fine-Tune Whisper For Multilingual ASR with 🤗 Transformers

In this tutorial, when only changing the language from Hindi to Dutch, and doing nothing else, it displays the exact same behaviour?

This behaviour does not happen when using whisper in the huggingface pipeline for inference, i.e.:

model_id = "openai/whisper-large-v3"
language = "Dutch"

# Initialize tokenizer and feature extractor with language setting
feature_extractor = WhisperFeatureExtractor.from_pretrained(model_id)
tokenizer = WhisperTokenizer.from_pretrained(model_id, language=language, task="transcribe")

model = WhisperForConditionalGeneration.from_pretrained(model_id)
forced_decoder_ids = tokenizer.get_decoder_prompt_ids(language=language, task="transcribe")

model = model.half()
model.config.forced_decoder_ids = forced_decoder_ids

asr_pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    feature_extractor=feature_extractor,
    tokenizer=tokenizer,
    chunk_length_s=30,
    stride_length_s=(4, 2),
    torch_dtype=torch_dtype,
    device=device,
)

oftheaze · February 1, 2024, 2:38pm

I do not understanding the mechanism by which this is caused. But I have noticed that

forced_decoder_ids = processor.tokenizer.get_decoder_prompt_ids(language="Dutch", task="transcribe")

and ensuring you are on transformers 4.37.2 resolves this. Setting the forced decoder prompt ids currently does not work on the dev branch.

Topic		Replies	Views
Tiny whisper finetuning for french speech recognition Models	3	444	September 17, 2024
Whisper fine-tuning on Librispeech makes WER worse 🤗Transformers	6	2409	June 26, 2023
Fine Tuning Whisper on my own Dataset with a customized Tokenizer Beginners	16	12404	February 12, 2024
Inconsistent evaluation result (WER) when finetuning wav2vev2 pretrained model Beginners	2	453	November 7, 2023
Fine-tuning whisper on sound-event-detection dataset 🤗Transformers	0	89	December 20, 2024

Whisper Finetuning Dutch: Weird double characters

Related topics