BPE tokenizers and spaces before words

thomwolf · July 29, 2020, 9:01am

Hi Boris, here is some context and history on the GPT2 and Roberta tokenizers:

In GPT2 and Roberta tokenizers, the space before a word is part of a word, i.e. "Hello how are you puppetter" will be tokenized in ["Hello", "Ġhow", "Ġare", "Ġyou", "Ġpuppet", "ter"]. You can notice the spaces included in the words a Ġ here. Spaces are converted in a special character (the Ġ ) in the tokenizer prior to BPE splitting mostly to avoid digesting spaces since the standard BPE algorithm used spaces in its process (this can seem a bit hacky but was in the original GPT2 tokenizer implementation by OpenAI).

You probably have noted that the first word is a bit different because it’s lacking the first space but actually the model is trained like this and reach its best performances like this, with a special first word (see https://github.com/huggingface/transformers/issues/3788)
However, this behavior is a bit strange to some users because the first word is then different from the others: encoding Cats are super coolio and super coolio will not give the same tokenization (see here for instance: https://github.com/huggingface/transformers/issues/5249)
transformers thus provide an add_prefix_space argument to automatically add a space at the beginning if none is provided (more intuitive tokenization but slightly lower performances though).
The library used to have a complex mechanism to disable this when special tokens are used and control it dynamically. This mechanism was error-prone and this behavior is now simply activated or not at instantiation of the tokenizer (i.e. as an argument in from_pretrained ).
Also note that adding prefix space is necessary when the tokenizer is used with pre-tokenized inputs ( is_pretokenized=True ) the library has a test that raise an error if you want to encode some input with add_prefix_space=False : https://github.com/huggingface/transformers/blob/master/src/transformers/tokenization_gpt2.py#L364

Topic		Replies	Views
BPEDecoder no spaces after special tokens Intermediate	4	2030	April 19, 2023
`add_prefix_space=True` option for the BPE tokenizer 🤗Transformers	0	1691	October 19, 2020
`GPT2Tokenizer` Tokenizer handling `\n\n` differently in different settings 🤗Tokenizers	4	778	October 4, 2023
Unmasking adds an extra whitespace for BPE tokenizer 🤗Tokenizers	0	270	January 14, 2024
Roberta pretokenizer - split punctuation? Beginners	2	204	March 30, 2024

BPE tokenizers and spaces before words

Related topics