How to customize behavior of added special tokens in a pretrained tokenizer?

Hi, I’m working on a project where I use a custom token to be placed in some locations in a text. I read about how to add special tokens from this github issue, and I’m wondering whether this special token will be automatically added to any text I have? How should I actually go about adding this token into the tokenizer?

Here’s an example to better illustrate what I mean. Let’s say the special token is [D]. If I have sentence A and sentence B, I would want to create the input_ids for the whole thing as “A [D] B”. What should be the input to the tokenizer to make sure the output matches this?

Another question: it seems that periods are usually automatically replaced with <s> special token, is this a special feature specifically designed for periods, or are there similar things one can do with other strings?

2 Likes