When using GPT2Tokenizer, I need to add an additional special token, let’s say <|special|>. But when adding it by using 'additional_special_tokens'
, it gives out unexpected results.
from transformers import GPT2Tokenizer
tokenizer = GPT2Tokenizer.from_pretrained(direction)
tokenizer.add_special_tokens({'additional_special_tokens': ['<|special|>']})
text_expected = '<|endoftext|> bye'
tokenizer(text_expected, return_tensors='pt').input_ids
-> tensor([[50256, 33847]])
text_unexpected = '<|special|> bye'
tokenizer(text_unexpected, return_tensors='pt').input_ids
-> tensor([[50257, 16390]])
the id of ‘bye’ changed, seems the tokenizer regards the <|endoftext|> as part of the sentence but not for <|special|>? is this expected to be seen? Thanks for any explanations.