The Whisper model, has the possibility of a prompt or adding the previous text to the current transcription task. This helps in case of transcribing long file chunk after chunk.
During training it should “mask out the training loss over the previous context text, and train the
model to predict all other tokens”.
I’m wondering if HF has implemented that and how well does it helps with the accuracy.
There is also this issue in Huggingface Transformers:
I’m wondering if it is fixed and stable.
There it seems that the way of doing it with HF would be is something like so:
prev_before = '<|startofprev|>'
current_before = "<|startoftranscript|><|en|><|transcribe|><|notimestamps|>"
current_after = "<|endoftext|>"
prev_text = 'Hello and'
current_text = 'welcome. My name is...'
prompt_and_text = prev_before + prev_text + current_before + current_text + current_after
prompt_and_text
<|startofprev|>Hello and<|startoftranscript|><|en|><|transcribe|><|notimestamps|>welcome. My name is…<|endoftext|>
prompt_and_text_tokens = _tokenizer.encode(prompt_and_text, add_special_tokens=False)
_tokenizer.decode(prompt_and_text_tokens)
<|startofprev|>Hello and<|startoftranscript|><|en|><|transcribe|><|notimestamps|>welcome. My name is…<|endoftext|>