Tokenization problem

10qwert · May 13, 2022, 8:51am

I am trying to use BART pretrained model to train a pointer generator network.
example input of the task:

source = "remind me to write thank you letters to invited"
target = "[IN:CREATE_REMINDER remind [SL:PERSON_REMINDED me ] to [SL:TODO write thank you letters to invited ] ]"

First I added special tokens to the tokenizer

tokenizer = BartTokenizer.from_pretrained("facebook/bart-large")
ontologyToken = ["[IN:CREATE_REMINDER", "[SL:PERSON_REMINDED", "]"etc...]
for item in ontologyToken:
    tokenizer.add_tokens(item, special_tokens=True)

Then I tried to tokenize them

sourceToken = tokenizer(source)["input_ids"]
targetToken = tokenizer(target)["input_ids"]
print(sourceToken)
print(targetToken)

Output: (highlighted ones are the special tokens)

sourceToken: [0, 5593, 2028, 162, 7, 3116, 3392, 47, 5430, 7, 4036, 2]
targetToken: [0, **50265**, 5593, 2028, **50266**, 1794, **742**, 560, **50267**, 29631, 3392, 47, 5430, 7, 4036, **742**, **742**, 2]

Since my model is a pointer generator network, it involves computing attention and pointing to a specific source token and use that as an output token hence the targetToken has to contain all the tokens present in the sourceToken. But clearly not all the sourceToken is present in the targetToken as they seem to have been tokenized differently.

In other words I want my target tokens to be tokenized in a way so that if all my special tokens are removed, my target sentence would be identifical to the source sentence.

So I decoded them to see what is going on.

print([tokenizer.decode(x) for x in sourceToken])
print([tokenizer.decode(x) for x in targetToken])

output:

['<s>', 'rem', 'ind', ' me', ' to', ' write', ' thank', ' you', ' letters', ' to', ' invited', '</s>']
['<s>', '[IN:CREATE_REMINDER', 'rem', 'ind', '[SL:PERSON_REMINDED', 'me', ']', 'to', '[SL:TODO', 'write', ' thank', ' you', ' letters', ' to', ' invited', ']', ']', '</s>']

We can see that every single word that comes after a special token is tokenized differently.
For example, in sourceToken, the word “me” is tokenized as " me" with a space bar, the targetToken doesn’t have that. How do I make it so that targetToken can be tokenized the same as the source.

Topic		Replies	Views
Train Bart for Conditional Generation (e.g. Summarization) Models	14	17160	November 22, 2023
Pretraining BART for conditional generation 🤗Transformers	1	978	May 30, 2022
Different tokenization for the same word fed alone vs in a sentence Beginners	0	279	July 6, 2021
Getting started with community BART model with no documentation Beginners	0	332	May 5, 2021
How to get 'sequences_scores' from 'scores' in 'generate()' method Beginners	6	6235	May 2, 2023

Tokenization problem

Related topics