BertTokenizerFast ignoring`never_split` argument


Since Transformers will start using the FastTokenizers by default in v4, I want to report a difference between the “slow” tokenizers and the fast Tokenizers.

The never_split argument passed to the Fast Tokenizer class has no effect on the tokenization output, as you see below

from transformers import BertTokenizerFast, BertTokenizer

tokFast = BertTokenizerFast.from_pretrained("bert-base-cased",
print(tokFast.tokenize("Hey lol+"))
#>>> ['Hey', 'lo', '##l', '+']
tokSlow = BertTokenizer.from_pretrained("bert-base-cased",
print(tokSlow.tokenize("Hey lol+"))
#>>> ['Hey', 'lol+']

Note: This issue is breaking one of our models aubminlab/bert-base-arabert, since we provide the tokenizer with a list of tokens that shouldn’t be split i.e. our tokenized data used the “+” sign appended to Arabic prefixes and suffixes during pre-segmentation. And when passing the presegmented text thru the tokenizer, space is inserted between the “+” sign and other letters. Hence we provided the model with a list of the tokens that shouldn’t be split.

Note2: not sure if i should post this as an issue in the github repo