Adding prompt / context to Whisper with Huggingface Transformers

SamuelAzran · February 3, 2023, 7:35pm

The Whisper model, has the possibility of a prompt or adding the previous text to the current transcription task. This helps in case of transcribing long file chunk after chunk.

During training it should “mask out the training loss over the previous context text, and train the
model to predict all other tokens”.

I’m wondering if HF has implemented that and how well does it helps with the accuracy.

There is also this issue in Huggingface Transformers:

github.com/huggingface/transformers

Whisper doesn't compute positional embeddings properly when given batches of prompt tokens

opened 05:34PM - 06 Dec 22 UTC

closed 03:02PM - 14 Jan 23 UTC

andyehrenberg

### System Info v4.25.1 on M1 Mac with python 3.8 ### Who can help? @sanchit-…gandhi @patrickvonplaten @anton-l ### Information - [ ] The official example scripts - [ ] My own modified scripts ### Tasks - [x] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...) - [X] My own task or dataset (give details below) ### Reproduction When we want to run Whisper generation for a batch of samples with different prompt lengths (prefix tokens given to the decoder), positional embeddings for the decoder are improperly computed. It assumes all sequences have the same `past_key_values_length`, but this is not true in general. Scenario: `decoder_input_ids = [50361, 45431, 2584, 28682, 13, 50258, 50257, 50257]` (`"<|startofprev|>Something completely irrelevant.<|startoftranscript|><|pad|><|pad|>"`) `model.generate(input_features, decoder_input_ids=decoder_input_ids, decoder_attention_mask=decoder_attention_mask)` will not give the correct output because at the beginning of decoding, the pad tokens won't be taken into account that the positional embedding will be off. ### Expected behavior Instead of tracking `past_key_values_length`, it should use the attention mask to compute position ids. The current implementation is more based off of encoder-decoder architectures that would never do decoder prompting, but it should take more inspiration from decoder-only models to handle prompting. This is done for the Flax implementation in #20479

I’m wondering if it is fixed and stable.
There it seems that the way of doing it with HF would be is something like so:

prev_before = '<|startofprev|>'
current_before = "<|startoftranscript|><|en|><|transcribe|><|notimestamps|>"
current_after = "<|endoftext|>"

prev_text = 'Hello and'
current_text = 'welcome. My name is...'

prompt_and_text = prev_before + prev_text + current_before + current_text + current_after
prompt_and_text

<|startofprev|>Hello and<|startoftranscript|><|en|><|transcribe|><|notimestamps|>welcome. My name is…<|endoftext|>

prompt_and_text_tokens = _tokenizer.encode(prompt_and_text, add_special_tokens=False)
_tokenizer.decode(prompt_and_text_tokens)

<|startofprev|>Hello and<|startoftranscript|><|en|><|transcribe|><|notimestamps|>welcome. My name is…<|endoftext|>

radurevutchi · March 22, 2023, 8:54pm

@SamuelAzran did you manage to get this working? I’d like to have the exact same functionality.

andercorral · March 24, 2023, 1:35pm

Any news on this?

SamuelAzran · March 24, 2023, 3:28pm

I did not see Huggingface supporting it or providing an easy way to do it, like in the open_ai library, but you can hack around or use lower-level components in HF to give context or prompt in inference time. In training time, training on prompts is much more complicated. I hope that someone from HF can address this issue.

The benefits of using prompts are beyond just giving the text of the previous speech segments. You can also provide prompts to control the style of the transcript and add relevant domain-specific or industry-specific terms to the context so the transcription process will be more likely to use those terms in the final transcript.

andercorral · March 24, 2023, 7:29pm

Great! Thanks! That’s what I will end up doing

wsfung2008 · July 11, 2023, 9:50am

Related discussion: Is it possible to do model adaptation? · openai/whisper · Discussion #66 · GitHub

You can also use logit biasing to increase the probability of certain domain specific words being predicted.

IntraphoneMarlar · April 30, 2024, 8:10am

Any update on support for this in hf libraries?

bmialet · January 20, 2025, 2:10pm

For anyone who is still looking for a solution.
I think I manged to make it work. Here is an example with whisper medium in French.

First extract the prompt ids with get_prompt_ids() method:

import torch
prev_text = "est ce qu'ils sont choqués ?"
prompt_ids = torch.tensor(whisper_processor.get_prompt_ids(prev_text), device=device)
print(whisper_processor.decode(prompt_ids, skip_special_tokens=False))

“<|startofprev|> est ce qu’ils sont choqués ?”

Then inject this prompt in inference. this inference uses no sampling to remain deterministic in order to show the real effect of the prompt:

audio_features = extract_audio_input_features(
    audio,
    whisper_processor,
    device,
    torch_dtype
)
whisper_model.generation_config.forced_decoder_ids = None
predicted_ids = whisper_model.generate(
    audio_features,
    language="fr",
    task="transcribe",
    eos_token_id=whisper_model.generation_config.eos_token_id,
    pad_token_id=whisper_model.generation_config.pad_token_id,
    max_time=5,
    do_sample=False,
    return_dict_in_generate=False,
    prompt_ids=prompt_ids
)

print(whisper_processor.decode(predicted_ids[0]))

“Non, ça va, pour l’instant ça va, ils sont choqués mais ça va.”

Here is the original transcription, without the prompt (also without any random sampling):
“Non, ça va. Pour l’instant, ça va. Ils sont pour chocrer, mais ça va.”
(Error on word “choqués”)

I’m open to remarks or ways to improve this.

Topic		Replies	Views
Is prompt properly implemented in the whisper model? 🤗Transformers	1	1588	September 19, 2024
Finetuning Whisper with prompts 🤗Transformers	3	4133	January 16, 2024
Issues with Whisper Encoder: Positional Encoding 🤗Transformers	4	1580	November 16, 2022
Is it possible to set initial_prompt and condition_on_previous_text with a whisper_pipeline? 🤗Transformers	4	2109	February 8, 2024
Performing Whisper's "transcribe" with Transformer pipelines Beginners	2	2703	December 19, 2023

Adding prompt / context to Whisper with Huggingface Transformers

Related topics