BART Tokenizer tokenises same word differently?

I have noticed that if I tokenize a full text with many sentences, I sometimes get a different number of tokens than if I tokenise each sentence individually and add up the tokens. I have done some debugging and have this small reproducible example to show the issue

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('facebook/bart-large-cnn')

print(tokenizer.tokenize("Thames is a river"))
print(tokenizer.tokenize("We are in London. Thames is a river"))

I get the following output

['Th', 'ames', 'Ä is', 'Ä a', 'Ä river']
['We', 'Ä are', 'Ä in', 'Ä London', '.', 'Ä Thames', 'Ä is', 'Ä a', 'Ä river']

I would like to understand why the word Thames has been split into two tokens when it’s at the start of sequence, whereas it’s a single word if it’s not at the start of sequence. I have noticed this behaviour is very frequent and, assuming it’s not a bug, I would like to understand why the BART tokeniser behaves like this.

Many thanks

For those interested, the reply to this question has now been provided by a StackOverflow user here