How to set audio language in Whisper Pipeline?

I’d like to use the Whisper model in an ASR pipeline for languages other than English, but I’m not sure how to tell the pipeline which language the audio file is in. Per default, it seems to actually understand the meaning of the audio (which is in German) but then always to translate it into English:

from transformers import pipeline
pipe = pipeline(task="automatic-speech-recognition", model="openai/whisper-small")

# ground truth: "So sind verschiedene Überlandstrecken geplant."
# model prediction: "So are various crossings planned."

I tried adding language="de" when creating the pipeline or when calling the pipeline, but to no avail.


@sanchit-gandhi I ve got the same problem with a fine tuned model according to your amazing blog post

By default, the Whisper-pipeline adds IDs of some special tokens to the beginning of the data (pipe.model.config.forced_decoder_ids), namely [[1, 50259], [2, 50359], [3, 50363]].
The meaning of these token-IDs can be found in the “added_tokens.json” file

  • “<|en|>”: 50259,
  • “<|transcribe|>”: 50359,
  • “<|notimestamps|>”: 50363.

You may either set the first token-id to “<|de|>” (ID 50261) explicitly like this:


or you simply set the forced_decoder_ids to None, but in my experience this does not work as reliably for German input:

pipe.model.config.forced_decoder_ids = None

Hey there, you can pass
forced_decoder_ids = processor.get_decoder_prompt_ids(language=language, task=task)

then in pipeline:
pipe(generate_kwargs={“forced_decoder_ids”: forced_decoder_ids}, … other args…)

1 Like

hi, have you find the solution ?

All you need is
result = pipe(“*.mp3”,generate_kwargs={“language”: “english”})