2 tokens for one character in T5

borgr · February 21, 2022, 8:14am

When tokenizing text, T5(small?) tokenizer adds an eos_token, that’s expected.

A bit weird is that for the character\string\sentence “0”, it tokenizes it to three tokens! One of them is a token that is detokenized to an empty string.

Is that an error on my part? Is it a bug? How does that happen? T5 is char based, right, so at the bare minimum, each character should be in the dictionary.

tokenizer_name = "t5-small"
tokenizer = AutoTokenizer.from_pretrained(tokenizer_name, use_fast=True)
print(tokenizer.encode("0")) # [3, 632, 1]
print(tokenizer.encode("1"))
print(tokenizer.encode("2"))
print(tokenizer.encode("3"))

Outputs:
[3, 632, 1]
[209, 1]
[204, 1]
[220, 1]

hyesunyun · August 10, 2023, 3:10pm

I have the same exact question. I can just call something like tokenizer.encode("0", add_special_tokens = False) to get rid of the special tokens such as eos_token. However, I am puzzled by the empty string tokens that you are seeing as well.

dblakely · August 10, 2023, 5:25pm

The extra token is a SentencePiece underline token, which is a space token to indicate that the proceeding token is either the start of a word or a standalone token. You can check what the actual tokens are like this:

ids = tokenizer.encode("0")
tokens = tokenizer.convert_ids_to_tokens(ids)
print(tokens)

Which shows:

['▁', '0', '</s>']

The '▁' is the 3 token you’re seeing. When you look at the tokens themselves or the token ids, you can see what’s going on, though when you decode or print, they’re omitted.

In fact, when you use the T5 tokenizer, all words start with a space like this, just most of them have the space built into the token itself. For example, the token for “1” is actually this: '▁1' (it’s just one token, but contains a space and the character for the 1).

Meanwhile, for words that contain multiple tokens, the first one in the word will have the space while the others won’t. For example:

ids = tokenizer.encode("Onomatopoeia")
tokens = tokenizer.convert_ids_to_tokens(ids)
print(tokens)

You get:

['▁On', 'omato', 'p', 'o', 'e', 'i', 'a', '</s>']

Why the tokenizer adds a separate space token in front of “0”, but has it merged into the token for 1, is probably just a statistical quirk of how the SentencePiece algorithm worked when it was run on the corpus the Google people used to create the T5Tokenizer. There’s no real a priori reason.

Topic		Replies	Views
2 possible bugs for adding new tokens to T5 🤗Transformers	3	1307	June 25, 2024
Adding token to t5-base vocab does not respect space 🤗Tokenizers	0	726	January 13, 2022
Tokenizer ignores repeated whitespaces 🤗Tokenizers	3	3287	May 19, 2022
How to add EOS when training T5? Intermediate	1	126	October 21, 2024
T5 tokenizer's post-processor is suboptimal for truncated sequences for seq2seq finetuning 🤗Transformers	0	329	July 5, 2023

2 tokens for one character in T5

Related topics