WhisperTokenizer bos_token appears incorrect

rouseabout · June 16, 2023, 1:33am

For all the Whisper pretrained models on HF, why does the tokenizer return bos_token_id as <|endoftext|> ?

tokenizer = WhisperTokenizer.from_pretrained("openai/whisper-tiny.en")
print(tokenizer.bos_token_id, tokenizer.decode(tokenizer.bos_token_id))

I expect it to be <|startoftranscript|>.

This results in the example code given in https://huggingface.co/blog/fine-tune-whisper not behaving as intended:

        # if bos token is appended in previous tokenization step,
        # cut bos token here as it's append later anyways
        if (labels[:, 0] == self.processor.tokenizer.bos_token_id).all().cpu().item():
            labels = labels[:, 1:]

The genesis of the problem appears to be special_tokens_map.json for each of the models. e.g. https://huggingface.co/openai/whisper-tiny.en/blob/main/special_tokens_map.json :

"bos_token": "<|endoftext|>",

b0r3k · May 2, 2024, 12:08pm

Hey,

seems like this GitHub issue addresses the same topic.

TL;DR: Whisper doesn’t use the bos_token_id, it has its own sequence of prompt tokens, and <|startoftranscript|> is saved in the model.config.decoder_start_token_id.

Personally, I don’t really understand the reasoning behind this. When you run tokenizer.encode(…) with add_special_tokens=True, it correctly prepends the <|startoftranscript|> token to the beginning of the sequence. However, it doesn’t seem to be possible to access this <|startoftranscript|> token as an attribute of the tokenizer – the closest I got was tokenizer.special_tokens_map["additional_special_tokens"][0]. That would be a rather hacky way, as you probably don’t want to rely on the order of additional_special_tokens.

Cheers!

Topic		Replies	Views
Trouble loading HF community's OpenAI Whisper models Beginners	3	1026	March 24, 2024
Whisper Message: Special tokens have been added in the vocabulary 🤗Transformers	0	350	March 28, 2024
Open ai whisper fine tuning on unknown language Beginners	0	78	October 1, 2024
GPT2Tokenizer not putting bos/eos token Intermediate	3	5466	March 31, 2024
Finetuning Whisper with prompts 🤗Transformers	3	4072	January 16, 2024

WhisperTokenizer bos_token appears incorrect

Related topics