Tokenizer vs. TokenizerFast

Hi,

When adding a new token in the vocabulary, there is a difference between Tokenizer and FastTokenizer.

from transformers import BartTokenizer, BartTokenizerFast

tokenizer = BartTokenizer.from_pretrained('facebook/bart-large') 
tokenizer_fast = BartTokenizerFast.from_pretrained('facebook/bart-large') 

tokenizer.add_tokens("<NEW_TOKEN>")
tokenizer_fast.add_tokens("<NEW_TOKEN>")

sentence = "I added a <NEW_TOKEN> in the vocabulary."

print(tokenizer.encode(sentence)) 
# [0, 100, 355, 10, 50265, 179, 5, 32644, 4, 2]

print(tokenizer_fast.encode(sentence))
# [0, 100, 355, 10, 1437, 50265, 11, 5, 32644, 4, 2]

The fast tokenizer adds a space token before the <NEW_TOKEN> (1437) while the standard tokenizer removes the automatic space from the next token (179 vs. 11).

I tried with RoBERTa and got the same problem.

Thanks!

1 Like

Technically speaking overall implementation of tokenizers wrt to Sentencepiece is kind of hacky in HuggingFace.

To control whether or not the space is added with fast tokenizers, you need to wrap it in an AddedToken:

from transformers import AddedToken

tokenizer_fast.add_tokens(AddedToken("<NEW_TOKEN>", lstrip=True))

You can also choose if you want to remove or not the space after with the rstrip argument.

@s4sarath Your remark is completely out of order, especially since BART uses a byte-level BPE tokenizer.

3 Likes

Thanks!

So it works well for the left space. However, keeping the right space using the rstrip argument doesn’t work properly in the standard tokenizer.

token = AddedToken("<NEW_TOKEN>", lstrip=True, rstrip=False)
tokenizer.add_tokens(token)
tokenizer_fast.add_tokens(token)

print(tokenizer.encode(sentence)) 
# [0, 100, 355, 10, 50265, 179, 5, 32644, 4, 2]

print(tokenizer_fast.encode(sentence))
#[0, 100, 355, 10, 50265, 11, 5, 32644, 4, 2]

Yes, Python tokenizer do not use the AddedToken type, you should use the fast tokenizer whenever available as it has more functionality.

2 Likes

@sgugger - I agree. That’s why I specifically mentioned wrt to Sentencepiece.