Hi all, I was inspecting the tokenization output of GPT2Tokenizer
and observe that it will tokenize \n\n
as a single token in some cases whereas in other cases it will tokenize it as two \n
tokens. Can someone explain why would the tokenizations differ?
Here is a quick example:
from transformers import GPT2Tokenizer, GPT2Model
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
tokenizer(" questions.\n\n Human")['input_ids']
>>> [2683, 13, 198, 198, 20490] # [' questions', '.', '\n', '\n', 'Human'] if followed by a space, will tokenize as 2 separate '\n'
tokenizer(" questions.\n\nHuman")['input_ids']
>>> [2683, 13, 628, 5524] # [' questions', '.', '\n\n', ' Human']
The reason “\n\n” might be tokenized differently in varying contexts could be influenced by a few factors:
Whitespace Handling: Tokenizers are often sensitive to whitespace. In your example, when a space follows the “\n\n”, the tokenizer might opt to handle each “\n” separately.
Context: The tokenizer may take into account the surrounding characters or tokens when determining how to tokenize a substring. It’s a way to better represent the syntactic and semantic properties of the text.
Thank you for the response
To follow up I would like to understand of how is the context such as the whitespace come into play in the GPT2 tokenizer when it tokenizes a string?
For example it would be great to see the step by step creation of the token list for the simple example I provided
Do you mean like this?
Starting Point: Tokenizer sees the string " questions.\n\n Human".
Tokenizing "questions": Recognizes " questions" as a single token [2683].
Tokenizing '.': Next is a period, recognized as [13].
Tokenizing '\n\n' with a following space: The presence of a space after \n\n causes the tokenizer to treat each newline as separate tokens. Thus, we get two tokens for \n as [198, 198].
Tokenizing "Human": The last word " Human" is recognized as a token [20490].
If you want to do this programmatically, I suppose that would be a interesting side project.
Hi Thanks again for the response, but this is just stating the obvious…
I am interested to understand the rules of the tokenization, so that If I had such rules in hand, it would be clear why when \n\n
is followed by a space, it is tokenized differently. Would be great to see the tokenization rules for the GPT2 tokenizer with a given vocabulary, explained in plain words.