Different tokenization for the same word fed alone vs in a sentence

BrunoLiegiBastonLieg · July 6, 2021, 10:29am

Hi all!

I am a fresh new PhD student working on NLP related tasks and it’s been a couple of months since I started using this awesome library.
I’ve always used BERT based models till now and everything has worked fine, but now I wanted to try something else, like GPT or BART. However I am facing this problem in the data preprocessing step.
Briefly, I obtain different tokenizations for the same word/entity depending on how I feed it to the pretrained tokenizer, alone vs in the sentence. For example, I have the following sentence:

‘Merpati flight 106 departed Jakarta ( CGK ) on a domestic flight to Tanjung Pandan ( TJQ ).’

that tokenized with the GPT pretrained tokenizer becomes:

[13102, 79, 7246, 5474, 15696, 24057, 49251, 357, 29925, 42, 1267, 319, 257, 5928, 5474, 284, 11818, 73, 2150, 16492, 272, 357, 41852, 48, 1267, 764]

Now consider the entity ‘TJQ’, and say that I want to find its span in the new tokenization scheme (i.e. its position in the list above). I usually run the tokenizer on ‘TJQ’ alone with add_special_tokens=False and then I look in the full sentence tokens list for the sublist of tokens obtained this way (btw this method is pretty ugly but its the first thing that came to my mind, suggestions on how to improve this are welcome…). This has always worked with BERT, but gives me trouble with other models. For example, when tokenizing ‘TJQ’ with the GPT pretrained tokenizer I get:

[51, 41, 48]

that doesn’t correspond to any sublist of the original list.
Firstly I was wondering why the tokenization differs, and why I haven’t encountered the same problem with BERT.
Secondly, is there any better way to find the span of a precise block of a sentence in the tokenized sentence list?

Thanks in advance for the help!

Topic		Replies	Views
BART Tokenizer tokenises same word differently? 🤗Tokenizers	1	722	August 24, 2022
Is there any difference in the tokenized output if I load the tokenizer from a different pretrained model Beginners	2	382	September 3, 2020
Separation token in GPT for text similarity/question answering Models	2	1464	March 23, 2021
Questions about the connection between tokenizer and the model Beginners	0	308	September 19, 2023
Tokenizer splits up pre-split tokens 🤗Tokenizers	9	6644	February 9, 2024

Different tokenization for the same word fed alone vs in a sentence

Related topics