I’m wondering how to properly use PreTrainedTokenizerBase.build_inputs_with_special_tokens.
According to the following example
# make sure GPT2 appends EOS in begin and end
def build_inputs_with_special_tokens(self, token_ids_0, token_ids_1=None):
outputs = [self.bos_token_id] + token_ids_0 + [self.eos_token_id]
return outputs
GPT2Tokenizer.build_inputs_with_special_tokens = build_inputs_with_special_tokens
it seems that we can simply overwrite the default function which does nothing.
But when I tried doing so in my own use case:
trained_tokenizer = PreTrainedTokenizerFast(tokenizer_file='tokenizer.json',)
trained_tokenizer.build_inputs_with_special_tokens = build_inputs_with_special_tokens
the results of tokenization are not with bos
and eos
.
Isn’t this function called automatically called during tokenization?