BART Tokenizer tokenises same word differently?

AndreaSottana · August 19, 2022, 4:53pm

Hello,
I have noticed that if I tokenize a full text with many sentences, I sometimes get a different number of tokens than if I tokenise each sentence individually and add up the tokens. I have done some debugging and have this small reproducible example to show the issue

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('facebook/bart-large-cnn')

print(tokenizer.tokenize("Thames is a river"))
print(tokenizer.tokenize("We are in London. Thames is a river"))

I get the following output

['Th', 'ames', 'Ġis', 'Ġa', 'Ġriver']
['We', 'Ġare', 'Ġin', 'ĠLondon', '.', 'ĠThames', 'Ġis', 'Ġa', 'Ġriver']

I would like to understand why the word Thames has been split into two tokens when it’s at the start of sequence, whereas it’s a single word if it’s not at the start of sequence. I have noticed this behaviour is very frequent and, assuming it’s not a bug, I would like to understand why the BART tokeniser behaves like this.

Many thanks

AndreaSottana · August 24, 2022, 2:43pm

UPDATE:
For those interested, the reply to this question has now been provided by a StackOverflow user here

Topic		Replies	Views
Different tokenization for the same word fed alone vs in a sentence Beginners	0	279	July 6, 2021
BART - Input two sentences? Beginners	2	728	June 13, 2022
BART tokenizer adds two EOS (</s>) tokens? Beginners	0	296	March 25, 2022
Different sentiments when texts processed in batches vs singles Intermediate	1	447	July 3, 2022
Tokenizer vs. TokenizerFast 🤗Transformers	5	6856	August 12, 2021

BART Tokenizer tokenises same word differently?

Related topics