How to set language in Whisper pipeline for audio transcription?

I want to use speech transcription with openai/whisper-medium model using pipeline

But I need to get the specified language in the output

I tried generate_kwargs=dict(forced_decoder_ids=forced_decoder_ids,) where forced_decoder_ids = processor.get_decoder_prompt_ids(language="french", task="transcribe") But the output is just an empty response: {'text': '', 'chunks': []}

Is there a way to set the language?

2 Likes

You can’t AFAIK, but you can get similar, if not identical results this way:

import librosa

from transformers import WhisperProcessor, WhisperForConditionalGeneration
from datasets import Audio, load_dataset

MAX_INPUT_LENGTH = 16000 * 30

# load model and processor
processor = WhisperProcessor.from_pretrained("openai/whisper-small")
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small")
forced_decoder_ids = processor.get_decoder_prompt_ids(language="french", task="transcribe")

# load audio sample
sample, sr = librosa.load("audio.WAV", sr=16000)
sample_batch = [sample[i:i + MAX_INPUT_LENGTH] for i in range(0, len(sample), MAX_INPUT_LENGTH)]
input_features = processor(sample_batch, sampling_rate=sr, return_tensors="pt").input_features

# generate token ids
predicted_ids = model.generate(input_features, forced_decoder_ids=forced_decoder_ids)
# decode token ids to text
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
print(transcription)

Hi F3RNI,

I have successfully managed to use whisper with a pipeline, on a specific language/task - therefore taking advantage of the smart chunking algorithm presented in this blog post.

My code is very similar to yours except that I don’t use WhisperProcessor. Instead, I declare the WhisperTokenizer and WhisperFeatureExtractor separately :

from transformers import WhisperForConditionalGeneration
from transformers import WhisperFeatureExtractor
from transformers import WhisperTokenizer
from transformers import pipeline

feature_extractor = WhisperFeatureExtractor.from_pretrained("openai/whisper-medium")
tokenizer = WhisperTokenizer.from_pretrained("openai/whisper-medium", language="french", task="transcribe")

model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-medium")
forced_decoder_ids = tokenizer.get_decoder_prompt_ids(language="french", task="transcribe")

asr_pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    feature_extractor=feature_extractor,
    tokenizer=tokenizer,
    chunk_length_s=30,
    stride_length_s=(4, 2)
)

Then you can use generate_kwargs as follows :

asr_pipe(
        audio_input,
        generate_kwargs={"forced_decoder_ids": forced_decoder_ids}
)["text"]

Hope this helps !

1 Like