Adding prompt / context to Whisper with Huggingface Transformers

The Whisper model, has the possibility of a prompt or adding the previous text to the current transcription task. This helps in case of transcribing long file chunk after chunk.

During training it should “mask out the training loss over the previous context text, and train the
model to predict all other tokens”.

I’m wondering if HF has implemented that and how well does it helps with the accuracy.

There is also this issue in Huggingface Transformers:

I’m wondering if it is fixed and stable.
There it seems that the way of doing it with HF would be is something like so:

prev_before = '<|startofprev|>'
current_before = "<|startoftranscript|><|en|><|transcribe|><|notimestamps|>"
current_after = "<|endoftext|>"

prev_text = 'Hello and'
current_text = 'welcome. My name is...'

prompt_and_text = prev_before + prev_text + current_before + current_text + current_after

<|startofprev|>Hello and<|startoftranscript|><|en|><|transcribe|><|notimestamps|>welcome. My name is…<|endoftext|>

prompt_and_text_tokens = _tokenizer.encode(prompt_and_text, add_special_tokens=False)

<|startofprev|>Hello and<|startoftranscript|><|en|><|transcribe|><|notimestamps|>welcome. My name is…<|endoftext|>


@SamuelAzran did you manage to get this working? I’d like to have the exact same functionality.

Any news on this?

I did not see Huggingface supporting it or providing an easy way to do it, like in the open_ai library, but you can hack around or use lower-level components in HF to give context or prompt in inference time. In training time, training on prompts is much more complicated. I hope that someone from HF can address this issue.

The benefits of using prompts are beyond just giving the text of the previous speech segments. You can also provide prompts to control the style of the transcript and add relevant domain-specific or industry-specific terms to the context so the transcription process will be more likely to use those terms in the final transcript.

1 Like

Great! Thanks! That’s what I will end up doing

Related discussion: Is it possible to do model adaptation? · openai/whisper · Discussion #66 · GitHub

You can also use logit biasing to increase the probability of certain domain specific words being predicted.