WhisperTokenizer bos_token appears incorrect

For all the Whisper pretrained models on HF, why does the tokenizer return bos_token_id as <|endoftext|> ?

tokenizer = WhisperTokenizer.from_pretrained("openai/whisper-tiny.en")
print(tokenizer.bos_token_id, tokenizer.decode(tokenizer.bos_token_id))

I expect it to be <|startoftranscript|>.

This results in the example code given in https://huggingface.co/blog/fine-tune-whisper not behaving as intended:

        # if bos token is appended in previous tokenization step,
        # cut bos token here as it's append later anyways
        if (labels[:, 0] == self.processor.tokenizer.bos_token_id).all().cpu().item():
            labels = labels[:, 1:]

The genesis of the problem appears to be special_tokens_map.json for each of the models. e.g. https://huggingface.co/openai/whisper-tiny.en/blob/main/special_tokens_map.json :

"bos_token": "<|endoftext|>",

Hey,

seems like this GitHub issue addresses the same topic.

TL;DR: Whisper doesn’t use the bos_token_id, it has its own sequence of prompt tokens, and <|startoftranscript|> is saved in the model.config.decoder_start_token_id.

Personally, I don’t really understand the reasoning behind this. When you run tokenizer.encode(…) with add_special_tokens=True, it correctly prepends the <|startoftranscript|> token to the beginning of the sequence. However, it doesn’t seem to be possible to access this <|startoftranscript|> token as an attribute of the tokenizer – the closest I got was tokenizer.special_tokens_map["additional_special_tokens"][0]. That would be a rather hacky way, as you probably don’t want to rely on the order of additional_special_tokens.

Cheers!