Regular tokens vs special tokens

Felipehonorato · May 14, 2021, 7:40pm

Based on the CTRL approach on GPT2, i’m trying to add tokens in order to control my text generation style. Is there a difference between adding a token as a regular one and adding it as a special token?

lewtun · May 17, 2021, 10:09am

hey @Felipehonorato, as far as i know, special tokens won’t be split by the tokenizer which might be handy in your case where you’re trying to incorporate control tokens. you can find more information in the docs here.

Felipehonorato · May 24, 2021, 10:51am

Really thank you for the reply. Just to make sure i understood: When adding a word as a regular token, the tokenizer can maybe split this word into subwords, and when passing it as a special token this shouldn’t happen, am I right? I’ve made a couple of tests and by passing as special tokens worked pretty well

lewtun · May 26, 2021, 12:20pm

yes, you’re totally right regarding the difference between regular vs special tokens! in case you’re interested, this is all being handled by the tokenizers library, which has some extra details here: Input sequences — tokenizers documentation

jaideepcs · February 17, 2023, 5:47pm

@lewtun
if i have taxonomy of text
T1
T2
T3
to establish a context between T1 to T3 , would I need special token or normal tokens
T1 [L2] T2 [L3] T3

Mahdip72 · January 8, 2024, 5:30am

@lewtun Could you help us in this?

Tokenizer.add_tokens automatically convert ESM2 new token to special

We can not extend the embedding layer by adding new tokens using ESM-2 models.

Topic		Replies	Views
How to customize behavior of added special tokens in a pretrained tokenizer? Intermediate	0	605	May 5, 2021
Slow Tokenizer adds whitespace after special token 🤗Transformers	4	1401	August 8, 2023
Why we need to add special tokens to tasks other than classification? 🤗Tokenizers	0	869	November 17, 2021
Is it OK to get word embedding without adding special tokens? Beginners	3	1361	April 15, 2023
Preprocessing data for custom tokenizer 🤗Transformers	0	251	October 21, 2022

Regular tokens vs special tokens

Tokenizer.add_tokens automatically convert ESM2 new token to special

Related topics