Hi Boris, here is some context and history on the GPT2 and Roberta tokenizers:
In GPT2 and Roberta tokenizers, the space before a word is part of a word, i.e. "Hello how are you puppetter"
will be tokenized in ["Hello", "Ä how", "Ä are", "Ä you", "Ä puppet", "ter"]
. You can notice the spaces included in the words a Ä
here. Spaces are converted in a special character (the Ä
) in the tokenizer prior to BPE splitting mostly to avoid digesting spaces since the standard BPE algorithm used spaces in its process (this can seem a bit hacky but was in the original GPT2 tokenizer implementation by OpenAI).
- You probably have noted that the first word is a bit different because it’s lacking the first space but actually the model is trained like this and reach its best performances like this, with a special first word (see https://github.com/huggingface/transformers/issues/3788)
- However, this behavior is a bit strange to some users because the first word is then different from the others: encoding
Cats are super coolio
andsuper coolio
will not give the same tokenization (see here for instance: https://github.com/huggingface/transformers/issues/5249) -
transformers
thus provide anadd_prefix_space
argument to automatically add a space at the beginning if none is provided (more intuitive tokenization but slightly lower performances though). - The library used to have a complex mechanism to disable this when special tokens are used and control it dynamically. This mechanism was error-prone and this behavior is now simply activated or not at instantiation of the tokenizer (i.e. as an argument in
from_pretrained
). - Also note that adding prefix space is necessary when the tokenizer is used with pre-tokenized inputs (
is_pretokenized=True
) the library has a test that raise an error if you want to encode some input withadd_prefix_space=False
: https://github.com/huggingface/transformers/blob/master/src/transformers/tokenization_gpt2.py#L364