Regular tokens vs special tokens

Based on the CTRL approach on GPT2, i’m trying to add tokens in order to control my text generation style. Is there a difference between adding a token as a regular one and adding it as a special token?

hey @Felipehonorato, as far as i know, special tokens won’t be split by the tokenizer which might be handy in your case where you’re trying to incorporate control tokens. you can find more information in the docs here.

Really thank you for the reply. Just to make sure i understood: When adding a word as a regular token, the tokenizer can maybe split this word into subwords, and when passing it as a special token this shouldn’t happen, am I right? I’ve made a couple of tests and by passing as special tokens worked pretty well

yes, you’re totally right regarding the difference between regular vs special tokens! in case you’re interested, this is all being handled by the tokenizers library, which has some extra details here: Input sequences — tokenizers documentation

1 Like

@lewtun
if i have taxonomy of text
T1
T2
T3
to establish a context between T1 to T3 , would I need special token or normal tokens
T1 [L2] T2 [L3] T3

@lewtun Could you help us in this?

Tokenizer.add_tokens automatically convert ESM2 new token to special

We can not extend the embedding layer by adding new tokens using ESM-2 models.