I’d like to use the Whisper model in an ASR pipeline for languages other than English, but I’m not sure how to tell the pipeline which language the audio file is in. Per default, it seems to actually understand the meaning of the audio (which is in German) but then always to translate it into English:
from transformers import pipeline
pipe = pipeline(task="automatic-speech-recognition", model="openai/whisper-small")
pipe("testfile.mp3")
# ground truth: "So sind verschiedene Ăśberlandstrecken geplant."
# model prediction: "So are various crossings planned."
I tried adding language="de" when creating the pipeline or when calling the pipeline, but to no avail.
By default, the Whisper-pipeline adds IDs of some special tokens to the beginning of the data (pipe.model.config.forced_decoder_ids), namely [[1, 50259], [2, 50359], [3, 50363]].
The meaning of these token-IDs can be found in the “added_tokens.json” file
“<|en|>”: 50259,
“<|transcribe|>”: 50359,
“<|notimestamps|>”: 50363.
You may either set the first token-id to “<|de|>” (ID 50261) explicitly like this:
pipe.model.config.forced_decoder_ids[0][1]=50261
or you simply set the forced_decoder_ids to None, but in my experience this does not work as reliably for German input:
Quick follow-up: where can we find the language codes / languages of Whisper? (or how can we correctly translate a two letter language code to the correct whisper language?