Finetuning Whisper with prompts

andercorral · July 11, 2023, 9:12am

I started digging into the actual code and I just realized that the Whisper tokenizer can accept two sentences as input just as models such as BERT do. For BERT-like models the two input sentences are concated and separated by a [SEP] token:

[CLS] sentence1 [SEP] sentence2 [SEP]

This behaviour is kept in the Whisper tokenizer too for API consistency issues, although it is not actually used during the finetuning process. In the current code, it simply concatenates both sentences if passed.

github.com

huggingface/transformers/blob/main/src/transformers/models/whisper/tokenization_whisper.py#L431


      
              if not self.predict_timestamps:
                  bos_sequence.append(notimestamps_token_id)
              return bos_sequence
          
          # Copied from transformers.models.speech_to_text.tokenization_speech_to_text.Speech2TextTokenizer.build_inputs_with_special_tokens
          def build_inputs_with_special_tokens(self, token_ids_0, token_ids_1=None) -> List[int]:
              """Build model inputs from a sequence by appending eos_token_id."""
              if token_ids_1 is None:
                  return self.prefix_tokens + token_ids_0 + [self.eos_token_id]
              # We don't expect to process pairs, but leave the pair logic for API consistency
              return self.prefix_tokens + token_ids_0 + token_ids_1 + [self.eos_token_id]
          
          # Copied from transformers.models.speech_to_text.tokenization_speech_to_text.Speech2TextTokenizer.get_special_tokens_mask
          def get_special_tokens_mask(
              self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None, already_has_special_tokens: bool = False
          ) -> List[int]:
              """
              Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding
              special tokens using the tokenizer `prepare_for_model` method.
          
              Args:

In order to avoid too many changes in the code, I would simply replace that line with the following code, so that it matches the format stated in the original paper:

start_of_prev_id = self.all_special_ids[-3]
return [start_of_prev_id] + token_ids_1 + self.prefix_tokens + token_ids_0 + [self.eos_token_id]

After that change, passing both the actual transciption and the prompt to the tokenizer it should return the expected format:

processor = WhisperProcessor.from_pretrained("openai/whisper-small", language="english", task="transcribe")
processor.tokenizer(" and the actual transcription is this.", "This is the prompt")

# The code above should return this
# <|startofprev|>This is the prompt<|startoftranscript|> <|transcribe|> <|en|> <|notimestamps|> and the actual transcription is this.<|endoftext|>

I hope this serves as the starting point to finetune Whisper with prompts.

Topic		Replies	Views
<\|nospeech\|> tokens in seq2seq/whisper Models	0	421	November 19, 2023
Soft prompt learning for BERT and GPT using Transformers 🤗Transformers	4	3809	July 31, 2023
Fine Tuning Whisper on my own Dataset with a customized Tokenizer Beginners	16	12399	February 12, 2024
Adding prompt / context to Whisper with Huggingface Transformers Models	7	6800	January 20, 2025
Korean finetuning on Whisper Beginners	1	1603	February 25, 2024

Finetuning Whisper with prompts

Related topics