2 tokens for one character in T5

When tokenizing text, T5(small?) tokenizer adds an eos_token, that’s expected.

A bit weird is that for the character\string\sentence “0”, it tokenizes it to three tokens! One of them is a token that is detokenized to an empty string.

Is that an error on my part? Is it a bug? How does that happen? T5 is char based, right, so at the bare minimum, each character should be in the dictionary.

tokenizer_name = "t5-small"
tokenizer = AutoTokenizer.from_pretrained(tokenizer_name, use_fast=True)
print(tokenizer.encode("0")) # [3, 632, 1]
print(tokenizer.encode("1"))
print(tokenizer.encode("2"))
print(tokenizer.encode("3"))

Outputs:
[3, 632, 1]
[209, 1]
[204, 1]
[220, 1]

I have the same exact question. I can just call something like tokenizer.encode("0", add_special_tokens = False) to get rid of the special tokens such as eos_token. However, I am puzzled by the empty string tokens that you are seeing as well.

The extra token is a SentencePiece underline token, which is a space token to indicate that the proceeding token is either the start of a word or a standalone token. You can check what the actual tokens are like this:

ids = tokenizer.encode("0")
tokens = tokenizer.convert_ids_to_tokens(ids)
print(tokens)

Which shows:

['▁', '0', '</s>']

The '▁' is the 3 token you’re seeing. When you look at the tokens themselves or the token ids, you can see what’s going on, though when you decode or print, they’re omitted.

In fact, when you use the T5 tokenizer, all words start with a space like this, just most of them have the space built into the token itself. For example, the token for “1” is actually this: '▁1' (it’s just one token, but contains a space and the character for the 1).

Meanwhile, for words that contain multiple tokens, the first one in the word will have the space while the others won’t. For example:

ids = tokenizer.encode("Onomatopoeia")
tokens = tokenizer.convert_ids_to_tokens(ids)
print(tokens)

You get:

['▁On', 'omato', 'p', 'o', 'e', 'i', 'a', '</s>']

Why the tokenizer adds a separate space token in front of “0”, but has it merged into the token for 1, is probably just a statistical quirk of how the SentencePiece algorithm worked when it was run on the corpus the Google people used to create the T5Tokenizer. There’s no real a priori reason.

1 Like