I have a tokenizer(baichuan2 tokenizer) which have lots of preserved tokens like bellow:
'<reserved_7>': 100,
'<reserved_8>': 101,
'<reserved_9>': 102,
'<reserved_10>': 103,
'<reserved_11>': 104,
'<reserved_12>': 105,
'<reserved_13>': 106,
'<reserved_14>': 107,
I want to replace the ‘<reserved_7>’ with ‘<|im_start|>’ and replace ‘<reserved_8>’ with ‘<|im_end|>’
what I want to get is a tokenizer which can act as below:
tokenizer.encode(‘<|im_start|>’) => 100
I do not want to use add_tokens
or add_special_tokens
, because this will change the model’s embedding size and introduce some inconvience in finetuning.