How to make GPT2 Tokenizer actually add special tokens

I’m using a GPT2TokenizerFast tokenizer. When tokenizing, the tokenizer will not add special tokens, even when add_special_tokens=True. This is baffling to me, but appears to be intended behavior. How can I convert this tokenizer to one that does the exact same thing but actually adds special tokens?

maybe you need to explain more specific. like how do code implement the add special token?

These are just standard special tokens, in this case bos and eos tokens. They are held in the special_tokens_map tokenizer attribute. When saved, they end up in the special_tokens_map.json file and vocab.json file. I’m not doing anything unusual or custom.

The issue is with other tokenizers, tokenizing with add_special_tokens=True causes bos and eos tokens to be added automatically, but with GPT2TokenizerFast this is ignored. GPT2TokenizerFast behavior requires manually adding bos and eos tokens to text strings before tokenizing and numericalizing. This has caused some very sneaky bugs to crop up because the add_special_tokens behavior is not documented anywhere.

One thing I’ve done is created a RobertaTokenizerFast from the vocab.json and merges.txt from a saved GPT2TokenizerFast. The Roberta type tokenizer correctly adds bos and eos tokens when add_special_tokens=True, but this feels hacky and confusing when we use this tokenizer with GPT2 models.

I’m looking for a way to get GPT2TokenizerFast to add bos and eos tokens when add_special_tokens=True, which is the standard behavior of other tokenizers in the library.

1 Like

Here running into the same. Is there a way to automatically add the bos_token and eos_token?