I am a fresh new PhD student working on NLP related tasks and it’s been a couple of months since I started using this awesome library.
I’ve always used BERT based models till now and everything has worked fine, but now I wanted to try something else, like GPT or BART. However I am facing this problem in the data preprocessing step.
Briefly, I obtain different tokenizations for the same word/entity depending on how I feed it to the pretrained tokenizer, alone vs in the sentence. For example, I have the following sentence:
‘Merpati flight 106 departed Jakarta ( CGK ) on a domestic flight to Tanjung Pandan ( TJQ ).’
that tokenized with the GPT pretrained tokenizer becomes:
[13102, 79, 7246, 5474, 15696, 24057, 49251, 357, 29925, 42, 1267, 319, 257, 5928, 5474, 284, 11818, 73, 2150, 16492, 272, 357, 41852, 48, 1267, 764]
Now consider the entity ‘TJQ’, and say that I want to find its span in the new tokenization scheme (i.e. its position in the list above). I usually run the tokenizer on ‘TJQ’ alone with add_special_tokens=False and then I look in the full sentence tokens list for the sublist of tokens obtained this way (btw this method is pretty ugly but its the first thing that came to my mind, suggestions on how to improve this are welcome…). This has always worked with BERT, but gives me trouble with other models. For example, when tokenizing ‘TJQ’ with the GPT pretrained tokenizer I get:
[51, 41, 48]
that doesn’t correspond to any sublist of the original list.
Firstly I was wondering why the tokenization differs, and why I haven’t encountered the same problem with BERT.
Secondly, is there any better way to find the span of a precise block of a sentence in the tokenized sentence list?
Thanks in advance for the help!