Based on the CTRL approach on GPT2, i’m trying to add tokens in order to control my text generation style. Is there a difference between adding a token as a regular one and adding it as a special token?
hey @Felipehonorato, as far as i know, special tokens won’t be split by the tokenizer which might be handy in your case where you’re trying to incorporate control tokens. you can find more information in the docs here.
Really thank you for the reply. Just to make sure i understood: When adding a word as a regular token, the tokenizer can maybe split this word into subwords, and when passing it as a special token this shouldn’t happen, am I right? I’ve made a couple of tests and by passing as special tokens worked pretty well
yes, you’re totally right regarding the difference between regular vs special tokens! in case you’re interested, this is all being handled by the
tokenizers library, which has some extra details here: Input sequences — tokenizers documentation