BPE tokenizers and spaces before words

boris · July 25, 2020, 8:16pm

Hi,

The documentation for GPT2Tokenizer suggests that we should keep the default of not adding spaces before words (add_prefix_space=False).

I understand that GPT2 was trained without adding spaces at the start of sentences, which results in different tokenizations.

However, I imagine that most of the text was similar to:

<|endoftext|>document_1<|endoftext|>document_2...

where document_n could be:

This is a long article from wikipedia. Lots of sentences.

So most of the time, new sentences would actually start with a space (separation from previous sentence) or a line break. I’m not aware of extra preprocessing that would remove spaces after punctuation?

In that case, it not obvious of what should be the best strategy when fine-tuning (adding spaces before words or not) as we may want to replicate what was the most common in initial dataset.

I would love any comment!

thomwolf · July 29, 2020, 9:01am

Hi Boris, here is some context and history on the GPT2 and Roberta tokenizers:

In GPT2 and Roberta tokenizers, the space before a word is part of a word, i.e. "Hello how are you puppetter" will be tokenized in ["Hello", "Ġhow", "Ġare", "Ġyou", "Ġpuppet", "ter"]. You can notice the spaces included in the words a Ġ here. Spaces are converted in a special character (the Ġ ) in the tokenizer prior to BPE splitting mostly to avoid digesting spaces since the standard BPE algorithm used spaces in its process (this can seem a bit hacky but was in the original GPT2 tokenizer implementation by OpenAI).

You probably have noted that the first word is a bit different because it’s lacking the first space but actually the model is trained like this and reach its best performances like this, with a special first word (see https://github.com/huggingface/transformers/issues/3788)
However, this behavior is a bit strange to some users because the first word is then different from the others: encoding Cats are super coolio and super coolio will not give the same tokenization (see here for instance: https://github.com/huggingface/transformers/issues/5249)
transformers thus provide an add_prefix_space argument to automatically add a space at the beginning if none is provided (more intuitive tokenization but slightly lower performances though).
The library used to have a complex mechanism to disable this when special tokens are used and control it dynamically. This mechanism was error-prone and this behavior is now simply activated or not at instantiation of the tokenizer (i.e. as an argument in from_pretrained ).
Also note that adding prefix space is necessary when the tokenizer is used with pre-tokenized inputs ( is_pretokenized=True ) the library has a test that raise an error if you want to encode some input with add_prefix_space=False : https://github.com/huggingface/transformers/blob/master/src/transformers/tokenization_gpt2.py#L364

boris · July 30, 2020, 9:18pm

Thanks so much for taking the time to reply! Here are the results from my tests.

I guess that the results are better without a space mainly because that is the way GPT-2 was trained. Intuitively I would think it helpful for the model to know that “think” and " think" are directly related (we could even go further with capitalized versions, etc).

Something that surprised me is that, in the original training, if sentences are separated by new lines or just spaces, the tokenization will be very different (not just a new line token).

I tested it and I’m also getting better results without adding extra space when fine-tuning on small tweets.

If we consider the 2 possible scenarios:

Scenario 1: training and prediction with <|endoftext|>token_without_space
Scenario 2: training and prediction with <|endoftext|> token_with_space

My intuition on the difference between these 2 scenarios is that the model will pull samples in probability from similar sequences.

There must be more samples of “Scenario 1” as I imagine most documents don’t start with a space, which as why we get better results.

We could also remove the <|endoftext|> token but in my tests we need to keep it as it probably also fulfills a “bos” function, letting the model know that we are starting a sample (learnt during fine-tuning).

I tried to create a new special token but results were much worse, probably because we need more data to learn its function and also because we lose in some way the pretrained knowledge where it was not present.

Let me know if you have any more insight. I’m really looking forward to large model training with a modified tokenizer, that would give as much info as possible at tokenization time to help the model.

Geo · September 7, 2023, 7:59pm

Please @boris I need your help with the special token <|endoftext|>
I think that my question at least for the token part is relevant to what is been described above.
I would like someone to clarify/disaprove the following. I have found a pretrained gpt2 model trained on the Greek language from huggingface named nikokons/gpt2-greek and I want to fine tune it on my custom dataset. My dataset consists of samples of mathematical definitions with related questions written in the Greek language. Let me give some translated examples

Definition: Two angles are called complementary angles when they sum up to 180 degrees.
Question: What are complementary angles?

Definition: Two angles are called complementary angles when they sum up to 180 degrees.
Question: How do we call two angles which sum up to 180 degrees?

Definition: A triangle is called isosceles when it has two sides of equally length.
Question: What is an isosceles triangle?

Definition: A triangle is called isosceles when it has two sides of equally length.
Question: What do we call a triangle which has two equally in length sides?

Notice that for a Definition I might have multiple questions on my dataset. I want to fine tune the model in order to learn to answer the user’s question by answering to the user with the entire Definition related to the user’s question.

What are the steps I should follow?
First fine tune the model to the raw dataset ( I mean the dataset without special tokens) in order to learn the new terminology and then preprocess the dataset in order to add in the beginning and at the ending of each sample the
|endoftext| token and finetune the model again on the new preprocessed dataset?

the processed training dataset should be like the following without starting with a space as you suggested?

|endoftext|A triangle is called isosceles when it has two sides of equally length. What is an isosceles triangle? |endoftext|Two angles are called complementary angles when they sum up to 180 degrees.
How do we call two angles which sum up to 180 degrees?|endoftext|

Also should I use padding=right when tokenizing the samples ot there is no need for that sincee from what I have read gpt2 can handle various lengths of sequences?

If I could find a complete example (with the way on how to process that dataset) for finetuning gpt2 on qna or chat would be very helpful. Basicaly I don’t know if the task I discribed earlier is in the area of conversation chat or qna

boris · September 8, 2023, 5:38pm

Maybe this report can help on how to build a dataset: Weights & Biases

Topic		Replies	Views
BPEDecoder no spaces after special tokens Intermediate	4	2041	April 19, 2023
`add_prefix_space=True` option for the BPE tokenizer 🤗Transformers	0	1712	October 19, 2020
`GPT2Tokenizer` Tokenizer handling `\n\n` differently in different settings 🤗Tokenizers	4	788	October 4, 2023
Unmasking adds an extra whitespace for BPE tokenizer 🤗Tokenizers	0	271	January 14, 2024
Roberta pretokenizer - split punctuation? Beginners	2	208	March 30, 2024

BPE tokenizers and spaces before words

Related topics