How to make GPT2 Tokenizer actually add special tokens

entropy · September 15, 2023, 6:52pm

I’m using a GPT2TokenizerFast tokenizer. When tokenizing, the tokenizer will not add special tokens, even when add_special_tokens=True. This is baffling to me, but appears to be intended behavior. How can I convert this tokenizer to one that does the exact same thing but actually adds special tokens?

bossalex · September 16, 2023, 3:15pm

maybe you need to explain more specific. like how do code implement the add special token?

entropy · September 16, 2023, 5:12pm

These are just standard special tokens, in this case bos and eos tokens. They are held in the special_tokens_map tokenizer attribute. When saved, they end up in the special_tokens_map.json file and vocab.json file. I’m not doing anything unusual or custom.

The issue is with other tokenizers, tokenizing with add_special_tokens=True causes bos and eos tokens to be added automatically, but with GPT2TokenizerFast this is ignored. GPT2TokenizerFast behavior requires manually adding bos and eos tokens to text strings before tokenizing and numericalizing. This has caused some very sneaky bugs to crop up because the add_special_tokens behavior is not documented anywhere.

One thing I’ve done is created a RobertaTokenizerFast from the vocab.json and merges.txt from a saved GPT2TokenizerFast. The Roberta type tokenizer correctly adds bos and eos tokens when add_special_tokens=True, but this feels hacky and confusing when we use this tokenizer with GPT2 models.

I’m looking for a way to get GPT2TokenizerFast to add bos and eos tokens when add_special_tokens=True, which is the standard behavior of other tokenizers in the library.

israfelsr · February 8, 2024, 7:54am

Here running into the same. Is there a way to automatically add the bos_token and eos_token?

Kenkentron · February 28, 2025, 10:29am

Try add_bos_token=True?

Topic		Replies	Views
GPT2Tokenizer not putting bos/eos token Intermediate	3	5475	March 31, 2024
Can't load tokenizer with added special tokens 🤗Transformers	0	824	March 29, 2022
How to efficiently tokenize unknown tokens in GPT2 Intermediate	0	1008	January 12, 2022
What is the correct way of using `build inputs with special tokens`? Beginners	0	495	February 27, 2023
GPT-2 special tokens Models	2	1975	February 20, 2024

How to make GPT2 Tokenizer actually add special tokens

Related topics