How to customize behavior of added special tokens in a pretrained tokenizer?

Aceticia · May 5, 2021, 3:04am

Hi, I’m working on a project where I use a custom token to be placed in some locations in a text. I read about how to add special tokens from this github issue, and I’m wondering whether this special token will be automatically added to any text I have? How should I actually go about adding this token into the tokenizer?

Here’s an example to better illustrate what I mean. Let’s say the special token is [D]. If I have sentence A and sentence B, I would want to create the input_ids for the whole thing as “A [D] B”. What should be the input to the tokenizer to make sure the output matches this?

Another question: it seems that periods are usually automatically replaced with <s> special token, is this a special feature specifically designed for periods, or are there similar things one can do with other strings?

Topic		Replies	Views
How to add special tokens to a pretrained model? Beginners	0	387	June 18, 2021
Word level tokenizer pulls special tokens out of pretokenized strings 🤗Tokenizers	3	19	July 4, 2025
Regular tokens vs special tokens 🤗Tokenizers	5	3528	January 8, 2024
transformers.Tokenizer produce unexpected results 🤗Transformers	0	208	April 26, 2023
How to add all standard special tokens to my tokenizer and model? Beginners	1	5895	August 11, 2022

How to customize behavior of added special tokens in a pretrained tokenizer?

Related topics