Are the slow and fast tokenizer results the same output for the same input?

mengban · August 30, 2023, 4:01am

I found some information but couldn’t find a definitive statement.
But I found an example where tokenization results are different:

from transformers import LlamaTokenizer, LlamaTokenizerFast

model_name_or_path = "ziqingyang/chinese-alpaca-2-7b"
sentence = "吃了吗</s>"

llama_t = LlamaTokenizer.from_pretrained(model_name_or_path)
llama_t_fast = LlamaTokenizerFast.from_pretrained(model_name_or_path)

print(llama_t(sentence))
print(llama_t_fast(sentence))

the output is:

{‘input_ids’: [1, 29871, 35302, 32013, 2], ‘attention_mask’: [1, 1, 1, 1, 1]}
{‘input_ids’: [1, 29871, 32050, 34353, 2], ‘attention_mask’: [1, 1, 1, 1, 1]}

and here is my env:

transformers                  4.31.0
tokenizers                    0.13.3

Topic		Replies	Views
Is there any difference in the tokenized output if I load the tokenizer from a different pretrained model Beginners	2	388	September 3, 2020
Difference between tokenizer and tokenizerfast Beginners	4	4333	December 22, 2023
Difference betweeen DistilBertTokenizerFast and DistilBertTokenizer? 🤗Transformers	2	3300	July 10, 2021
FastTokenizer add 10 more tokens in Avg 🤗Tokenizers	0	201	January 20, 2024
Newbie: Main difference between tokenizers? 🤗Tokenizers	0	852	May 6, 2021

Are the slow and fast tokenizer results the same output for the same input?

Related topics