BPE tokenizers and spaces before words

Hi,

The documentation for GPT2Tokenizer suggests that we should keep the default of not adding spaces before words (add_prefix_space=False).

I understand that GPT2 was trained without adding spaces at the start of sentences, which results in different tokenizations.

However, I imagine that most of the text was similar to:

<|endoftext|>document_1<|endoftext|>document_2...

where document_n could be:

This is a long article from wikipedia. Lots of sentences.

So most of the time, new sentences would actually start with a space (separation from previous sentence) or a line break. I’m not aware of extra preprocessing that would remove spaces after punctuation?

In that case, it not obvious of what should be the best strategy when fine-tuning (adding spaces before words or not) as we may want to replicate what was the most common in initial dataset.

I would love any comment!

Hi Boris, here is some context and history on the GPT2 and Roberta tokenizers:

In GPT2 and Roberta tokenizers, the space before a word is part of a word, i.e. "Hello how are you puppetter" will be tokenized in ["Hello", "Ġhow", "Ġare", "Ġyou", "Ġpuppet", "ter"]. You can notice the spaces included in the words a Ġ here. Spaces are converted in a special character (the Ġ ) in the tokenizer prior to BPE splitting mostly to avoid digesting spaces since the standard BPE algorithm used spaces in its process (this can seem a bit hacky but was in the original GPT2 tokenizer implementation by OpenAI).

  • You probably have noted that the first word is a bit different because it’s lacking the first space but actually the model is trained like this and reach its best performances like this, with a special first word (see https://github.com/huggingface/transformers/issues/3788)
  • However, this behavior is a bit strange to some users because the first word is then different from the others: encoding Cats are super coolio and super coolio will not give the same tokenization (see here for instance: https://github.com/huggingface/transformers/issues/5249)
  • transformers thus provide an add_prefix_space argument to automatically add a space at the beginning if none is provided (more intuitive tokenization but slightly lower performances though).
  • The library used to have a complex mechanism to disable this when special tokens are used and control it dynamically. This mechanism was error-prone and this behavior is now simply activated or not at instantiation of the tokenizer (i.e. as an argument in from_pretrained ).
  • Also note that adding prefix space is necessary when the tokenizer is used with pre-tokenized inputs ( is_pretokenized=True ) the library has a test that raise an error if you want to encode some input with add_prefix_space=False : https://github.com/huggingface/transformers/blob/master/src/transformers/tokenization_gpt2.py#L364
2 Likes

Thanks so much for taking the time to reply! Here are the results from my tests.

I guess that the results are better without a space mainly because that is the way GPT-2 was trained. Intuitively I would think it helpful for the model to know that “think” and " think" are directly related (we could even go further with capitalized versions, etc).

Something that surprised me is that, in the original training, if sentences are separated by new lines or just spaces, the tokenization will be very different (not just a new line token).

I tested it and I’m also getting better results without adding extra space when fine-tuning on small tweets.

If we consider the 2 possible scenarios:

  • Scenario 1: training and prediction with <|endoftext|>token_without_space
  • Scenario 2: training and prediction with <|endoftext|> token_with_space

My intuition on the difference between these 2 scenarios is that the model will pull samples in probability from similar sequences.

There must be more samples of “Scenario 1” as I imagine most documents don’t start with a space, which as why we get better results.

We could also remove the <|endoftext|> token but in my tests we need to keep it as it probably also fulfills a “bos” function, letting the model know that we are starting a sample (learnt during fine-tuning).

I tried to create a new special token but results were much worse, probably because we need more data to learn its function and also because we lose in some way the pretrained knowledge where it was not present.

Let me know if you have any more insight. I’m really looking forward to large model training with a modified tokenizer, that would give as much info as possible at tokenization time to help the model.