For all the Whisper pretrained models on HF, why does the tokenizer return bos_token_id
as <|endoftext|>
?
tokenizer = WhisperTokenizer.from_pretrained("openai/whisper-tiny.en")
print(tokenizer.bos_token_id, tokenizer.decode(tokenizer.bos_token_id))
I expect it to be <|startoftranscript|>
.
This results in the example code given in https://huggingface.co/blog/fine-tune-whisper not behaving as intended:
# if bos token is appended in previous tokenization step,
# cut bos token here as it's append later anyways
if (labels[:, 0] == self.processor.tokenizer.bos_token_id).all().cpu().item():
labels = labels[:, 1:]
The genesis of the problem appears to be special_tokens_map.json
for each of the models. e.g. https://huggingface.co/openai/whisper-tiny.en/blob/main/special_tokens_map.json :
"bos_token": "<|endoftext|>",