`GPT2Tokenizer` Tokenizer handling `\n\n` differently in different settings

ydukler · October 3, 2023, 8:28pm

Hi all, I was inspecting the tokenization output of GPT2Tokenizer and observe that it will tokenize \n\n as a single token in some cases whereas in other cases it will tokenize it as two \n tokens. Can someone explain why would the tokenizations differ?

Here is a quick example:

from transformers import GPT2Tokenizer, GPT2Model
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
tokenizer(" questions.\n\n Human")['input_ids']
>>> [2683, 13, 198, 198, 20490]  #  [' questions', '.', '\n', '\n', 'Human'] if followed by a space, will tokenize as 2 separate '\n'
tokenizer(" questions.\n\nHuman")['input_ids'] 
>>> [2683, 13, 628, 5524] # [' questions', '.', '\n\n', ' Human']

Bjornedt · October 3, 2023, 11:21pm

The reason “\n\n” might be tokenized differently in varying contexts could be influenced by a few factors:

Whitespace Handling: Tokenizers are often sensitive to whitespace. In your example, when a space follows the “\n\n”, the tokenizer might opt to handle each “\n” separately.

Context: The tokenizer may take into account the surrounding characters or tokens when determining how to tokenize a substring. It’s a way to better represent the syntactic and semantic properties of the text.

ydukler · October 4, 2023, 12:17am

Thank you for the response To follow up I would like to understand of how is the context such as the whitespace come into play in the GPT2 tokenizer when it tokenizes a string?

For example it would be great to see the step by step creation of the token list for the simple example I provided

Bjornedt · October 4, 2023, 11:05am

Do you mean like this?

Starting Point: Tokenizer sees the string " questions.\n\n Human".

Tokenizing "questions": Recognizes " questions" as a single token [2683].

Tokenizing '.': Next is a period, recognized as [13].

Tokenizing '\n\n' with a following space: The presence of a space after \n\n causes the tokenizer to treat each newline as separate tokens. Thus, we get two tokens for \n as [198, 198].

Tokenizing "Human": The last word " Human" is recognized as a token [20490].

If you want to do this programmatically, I suppose that would be a interesting side project.

ydukler · October 4, 2023, 4:37pm

Hi Thanks again for the response, but this is just stating the obvious…

I am interested to understand the rules of the tokenization, so that If I had such rules in hand, it would be clear why when \n\n is followed by a space, it is tokenized differently. Would be great to see the tokenization rules for the GPT2 tokenizer with a given vocabulary, explained in plain words.

Topic		Replies	Views
Tokenizer ignores repeated whitespaces 🤗Tokenizers	3	3318	May 19, 2022
GPT2TokenizerFast tokenzied output Beginners	0	154	December 29, 2023
Ġ token inserted by ByteLevelBPETokenizer 🤗Transformers	0	542	November 1, 2023
BPE tokenizers and spaces before words 🤗Transformers	4	26269	September 8, 2023
All my sequences get tokenized the same 🤗Tokenizers	2	609	February 12, 2022

`GPT2Tokenizer` Tokenizer handling `\n\n` differently in different settings

Related topics