I’m using a GPT2TokenizerFast
tokenizer. When tokenizing, the tokenizer will not add special tokens, even when add_special_tokens=True
. This is baffling to me, but appears to be intended behavior. How can I convert this tokenizer to one that does the exact same thing but actually adds special tokens?
maybe you need to explain more specific. like how do code implement the add special token?
These are just standard special tokens, in this case bos
and eos
tokens. They are held in the special_tokens_map
tokenizer attribute. When saved, they end up in the special_tokens_map.json
file and vocab.json
file. I’m not doing anything unusual or custom.
The issue is with other tokenizers, tokenizing with add_special_tokens=True
causes bos
and eos
tokens to be added automatically, but with GPT2TokenizerFast
this is ignored. GPT2TokenizerFast
behavior requires manually adding bos
and eos
tokens to text strings before tokenizing and numericalizing. This has caused some very sneaky bugs to crop up because the add_special_tokens
behavior is not documented anywhere.
One thing I’ve done is created a RobertaTokenizerFast
from the vocab.json
and merges.txt
from a saved GPT2TokenizerFast
. The Roberta type tokenizer correctly adds bos
and eos
tokens when add_special_tokens=True
, but this feels hacky and confusing when we use this tokenizer with GPT2 models.
I’m looking for a way to get GPT2TokenizerFast
to add bos
and eos
tokens when add_special_tokens=True
, which is the standard behavior of other tokenizers in the library.
Here running into the same. Is there a way to automatically add the bos_token
and eos_token
?