Finetuning Whisper with prompts

Hi @AvivSham,

I started digging into the actual code and I just realized that the Whisper tokenizer can accept two sentences as input just as models such as BERT do. For BERT-like models the two input sentences are concated and separated by a [SEP] token:

[CLS] sentence1 [SEP] sentence2 [SEP]

This behaviour is kept in the Whisper tokenizer too for API consistency issues, although it is not actually used during the finetuning process. In the current code, it simply concatenates both sentences if passed.

In order to avoid too many changes in the code, I would simply replace that line with the following code, so that it matches the format stated in the original paper:

start_of_prev_id = self.all_special_ids[-3]
return [start_of_prev_id] + token_ids_1 + self.prefix_tokens + token_ids_0 + [self.eos_token_id]

After that change, passing both the actual transciption and the prompt to the tokenizer it should return the expected format:

processor = WhisperProcessor.from_pretrained("openai/whisper-small", language="english", task="transcribe")
processor.tokenizer(" and the actual transcription is this.", "This is the prompt")

# The code above should return this
# <|startofprev|>This is the prompt<|startoftranscript|> <|transcribe|> <|en|> <|notimestamps|> and the actual transcription is this.<|endoftext|>

I hope this serves as the starting point to finetune Whisper with prompts.

1 Like