`GPT2Tokenizer` Tokenizer handling `\n\n` differently in different settings

Hi all, I was inspecting the tokenization output of GPT2Tokenizer and observe that it will tokenize \n\n as a single token in some cases whereas in other cases it will tokenize it as two \n tokens. Can someone explain why would the tokenizations differ?

Here is a quick example:

from transformers import GPT2Tokenizer, GPT2Model
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
tokenizer(" questions.\n\n Human")['input_ids']
>>> [2683, 13, 198, 198, 20490]  #  [' questions', '.', '\n', '\n', 'Human'] if followed by a space, will tokenize as 2 separate '\n'
tokenizer(" questions.\n\nHuman")['input_ids'] 
>>> [2683, 13, 628, 5524] # [' questions', '.', '\n\n', ' Human']

The reason “\n\n” might be tokenized differently in varying contexts could be influenced by a few factors:

Whitespace Handling: Tokenizers are often sensitive to whitespace. In your example, when a space follows the “\n\n”, the tokenizer might opt to handle each “\n” separately.

Context: The tokenizer may take into account the surrounding characters or tokens when determining how to tokenize a substring. It’s a way to better represent the syntactic and semantic properties of the text.

Thank you for the response :slight_smile: To follow up I would like to understand of how is the context such as the whitespace come into play in the GPT2 tokenizer when it tokenizes a string?

For example it would be great to see the step by step creation of the token list for the simple example I provided

Do you mean like this?

Starting Point: Tokenizer sees the string " questions.\n\n Human".

Tokenizing "questions": Recognizes " questions" as a single token [2683].

Tokenizing '.': Next is a period, recognized as [13].

Tokenizing '\n\n' with a following space: The presence of a space after \n\n causes the tokenizer to treat each newline as separate tokens. Thus, we get two tokens for \n as [198, 198].

Tokenizing "Human": The last word " Human" is recognized as a token [20490].

If you want to do this programmatically, I suppose that would be a interesting side project.

Hi Thanks again for the response, but this is just stating the obvious…

I am interested to understand the rules of the tokenization, so that If I had such rules in hand, it would be clear why when \n\n is followed by a space, it is tokenized differently. Would be great to see the tokenization rules for the GPT2 tokenizer with a given vocabulary, explained in plain words.