Tokenizer vs. TokenizerFast

cattana · August 12, 2021, 4:14pm

Hi,

When adding a new token in the vocabulary, there is a difference between Tokenizer and FastTokenizer.

from transformers import BartTokenizer, BartTokenizerFast

tokenizer = BartTokenizer.from_pretrained('facebook/bart-large') 
tokenizer_fast = BartTokenizerFast.from_pretrained('facebook/bart-large') 

tokenizer.add_tokens("<NEW_TOKEN>")
tokenizer_fast.add_tokens("<NEW_TOKEN>")

sentence = "I added a <NEW_TOKEN> in the vocabulary."

print(tokenizer.encode(sentence)) 
# [0, 100, 355, 10, 50265, 179, 5, 32644, 4, 2]

print(tokenizer_fast.encode(sentence))
# [0, 100, 355, 10, 1437, 50265, 11, 5, 32644, 4, 2]

The fast tokenizer adds a space token before the <NEW_TOKEN> (1437) while the standard tokenizer removes the automatic space from the next token (179 vs. 11).

I tried with RoBERTa and got the same problem.

Thanks!

s4sarath · August 12, 2021, 4:20pm

Technically speaking overall implementation of tokenizers wrt to Sentencepiece is kind of hacky in HuggingFace.

sgugger · August 12, 2021, 4:54pm

To control whether or not the space is added with fast tokenizers, you need to wrap it in an AddedToken:

from transformers import AddedToken

tokenizer_fast.add_tokens(AddedToken("<NEW_TOKEN>", lstrip=True))

You can also choose if you want to remove or not the space after with the rstrip argument.

@s4sarath Your remark is completely out of order, especially since BART uses a byte-level BPE tokenizer.

cattana · August 12, 2021, 5:18pm

Thanks!

So it works well for the left space. However, keeping the right space using the rstrip argument doesn’t work properly in the standard tokenizer.

token = AddedToken("<NEW_TOKEN>", lstrip=True, rstrip=False)
tokenizer.add_tokens(token)
tokenizer_fast.add_tokens(token)

print(tokenizer.encode(sentence)) 
# [0, 100, 355, 10, 50265, 179, 5, 32644, 4, 2]

print(tokenizer_fast.encode(sentence))
#[0, 100, 355, 10, 50265, 11, 5, 32644, 4, 2]

sgugger · August 12, 2021, 5:55pm

Yes, Python tokenizer do not use the AddedToken type, you should use the fast tokenizer whenever available as it has more functionality.

s4sarath · August 12, 2021, 9:56pm

@sgugger - I agree. That’s why I specifically mentioned wrt to Sentencepiece.

Topic		Replies	Views
Difference betweeen DistilBertTokenizerFast and DistilBertTokenizer? 🤗Transformers	2	3312	July 10, 2021
Cannot create an identical PretrainedTokenizerFast object from a Tokenizer created by tokenizers library 🤗Tokenizers	1	1099	August 30, 2021
Pre_tokenization 🤗Transformers	0	335	April 13, 2023
`add_prefix_space=True` option for the BPE tokenizer 🤗Transformers	0	1774	October 19, 2020
Added Tokens Not Decoding with Spaces 🤗Tokenizers	3	2888	January 19, 2024

Tokenizer vs. TokenizerFast

Related topics