I found some information but couldnāt find a definitive statement.
But I found an example where tokenization results are different:
from transformers import LlamaTokenizer, LlamaTokenizerFast
model_name_or_path = "ziqingyang/chinese-alpaca-2-7b"
sentence = "åäŗå</s>"
llama_t = LlamaTokenizer.from_pretrained(model_name_or_path)
llama_t_fast = LlamaTokenizerFast.from_pretrained(model_name_or_path)
print(llama_t(sentence))
print(llama_t_fast(sentence))
the output is:
{āinput_idsā: [1, 29871, 35302, 32013, 2], āattention_maskā: [1, 1, 1, 1, 1]}
{āinput_idsā: [1, 29871, 32050, 34353, 2], āattention_maskā: [1, 1, 1, 1, 1]}
and here is my env:
transformers 4.31.0
tokenizers 0.13.3