BPE tokenizers and spaces before words

Hi Boris, here is some context and history on the GPT2 and Roberta tokenizers:

In GPT2 and Roberta tokenizers, the space before a word is part of a word, i.e. "Hello how are you puppetter" will be tokenized in ["Hello", "Ä how", "Ä are", "Ä you", "Ä puppet", "ter"]. You can notice the spaces included in the words a Ä  here. Spaces are converted in a special character (the Ä  ) in the tokenizer prior to BPE splitting mostly to avoid digesting spaces since the standard BPE algorithm used spaces in its process (this can seem a bit hacky but was in the original GPT2 tokenizer implementation by OpenAI).

  • You probably have noted that the first word is a bit different because it’s lacking the first space but actually the model is trained like this and reach its best performances like this, with a special first word (see https://github.com/huggingface/transformers/issues/3788)
  • However, this behavior is a bit strange to some users because the first word is then different from the others: encoding Cats are super coolio and super coolio will not give the same tokenization (see here for instance: https://github.com/huggingface/transformers/issues/5249)
  • transformers thus provide an add_prefix_space argument to automatically add a space at the beginning if none is provided (more intuitive tokenization but slightly lower performances though).
  • The library used to have a complex mechanism to disable this when special tokens are used and control it dynamically. This mechanism was error-prone and this behavior is now simply activated or not at instantiation of the tokenizer (i.e. as an argument in from_pretrained ).
  • Also note that adding prefix space is necessary when the tokenizer is used with pre-tokenized inputs ( is_pretokenized=True ) the library has a test that raise an error if you want to encode some input with add_prefix_space=False : https://github.com/huggingface/transformers/blob/master/src/transformers/tokenization_gpt2.py#L364
13 Likes