The extra token is a SentencePiece underline token, which is a space token to indicate that the proceeding token is either the start of a word or a standalone token. You can check what the actual tokens are like this:
ids = tokenizer.encode("0")
tokens = tokenizer.convert_ids_to_tokens(ids)
print(tokens)
Which shows:
['â', '0', '</s>']
The 'â'
is the 3 token youâre seeing. When you look at the tokens themselves or the token ids, you can see whatâs going on, though when you decode or print, theyâre omitted.
In fact, when you use the T5 tokenizer, all words start with a space like this, just most of them have the space built into the token itself. For example, the token for â1â is actually this: 'â1'
(itâs just one token, but contains a space and the character for the 1).
Meanwhile, for words that contain multiple tokens, the first one in the word will have the space while the others wonât. For example:
ids = tokenizer.encode("Onomatopoeia")
tokens = tokenizer.convert_ids_to_tokens(ids)
print(tokens)
You get:
['âOn', 'omato', 'p', 'o', 'e', 'i', 'a', '</s>']
Why the tokenizer adds a separate space token in front of â0â, but has it merged into the token for 1, is probably just a statistical quirk of how the SentencePiece algorithm worked when it was run on the corpus the Google people used to create the T5Tokenizer. Thereâs no real a priori reason.