What is the correct way of using `build inputs with special tokens`?

mk6 · February 27, 2023, 7:21am

I’m wondering how to properly use PreTrainedTokenizerBase.build_inputs_with_special_tokens.

According to the following example

# make sure GPT2 appends EOS in begin and end
def build_inputs_with_special_tokens(self, token_ids_0, token_ids_1=None):
    outputs = [self.bos_token_id] + token_ids_0 + [self.eos_token_id]
    return outputs

GPT2Tokenizer.build_inputs_with_special_tokens = build_inputs_with_special_tokens

it seems that we can simply overwrite the default function which does nothing.

But when I tried doing so in my own use case:

trained_tokenizer = PreTrainedTokenizerFast(tokenizer_file='tokenizer.json',)
trained_tokenizer.build_inputs_with_special_tokens = build_inputs_with_special_tokens

the results of tokenization are not with bos and eos.
Isn’t this function called automatically called during tokenization?

Topic		Replies	Views
GPT2Tokenizer not putting bos/eos token Intermediate	3	5564	March 31, 2024
How to customize behavior of added special tokens in a pretrained tokenizer? Intermediate	0	612	May 5, 2021
Add BOS and EOS when encoding a sentence 🤗Tokenizers	2	15042	August 22, 2022
How to make GPT2 Tokenizer actually add special tokens Beginners	4	3115	February 28, 2025
How to train the embedding of special token? Intermediate	1	4193	October 17, 2021

What is the correct way of using `build inputs with special tokens`?

Related topics