Hi,
Since Transformers will start using the FastTokenizers by default in v4, I want to report a difference between the “slow” tokenizers and the fast Tokenizers.
The never_split
argument passed to the Fast Tokenizer class has no effect on the tokenization output, as you see below
from transformers import BertTokenizerFast, BertTokenizer
tokFast = BertTokenizerFast.from_pretrained("bert-base-cased",
do_basic_tokenize=True,
never_split=["lol"],
do_lower_case=False)
print(tokFast.tokenize("Hey lol+"))
#>>> ['Hey', 'lo', '##l', '+']
tokSlow = BertTokenizer.from_pretrained("bert-base-cased",
do_basic_tokenize=True,
never_split=["lol+"],
do_lower_case=False)
print(tokSlow.tokenize("Hey lol+"))
#>>> ['Hey', 'lol+']
Note: This issue is breaking one of our models aubminlab/bert-base-arabert
, since we provide the tokenizer with a list of tokens that shouldn’t be split i.e. our tokenized data used the “+” sign appended to Arabic prefixes and suffixes during pre-segmentation. And when passing the presegmented text thru the tokenizer, space is inserted between the “+” sign and other letters. Hence we provided the model with a list of the tokens that shouldn’t be split.
Note2: not sure if i should post this as an issue in the github repo