I found some information but couldnât find a definitive statement.
But I found an example where tokenization results are different:
from transformers import LlamaTokenizer, LlamaTokenizerFast
model_name_or_path = "ziqingyang/chinese-alpaca-2-7b"
sentence = "åäºå</s>"
llama_t = LlamaTokenizer.from_pretrained(model_name_or_path)
llama_t_fast = LlamaTokenizerFast.from_pretrained(model_name_or_path)
print(llama_t(sentence))
print(llama_t_fast(sentence))
the output is:
{âinput_idsâ: [1, 29871, 35302, 32013, 2], âattention_maskâ: [1, 1, 1, 1, 1]}
{âinput_idsâ: [1, 29871, 32050, 34353, 2], âattention_maskâ: [1, 1, 1, 1, 1]}
and here is my env:
transformers 4.31.0
tokenizers 0.13.3