Difference between tokenizer and tokenizerfast

Hi,

I have searched for the answer for my question, but still can’t get the clear answer.

Some issues in the github/forum also report that the result of tokenizer and tokenizerfast is a little bit different.

I want to know what is the difference between them (in terms of mechanism)?
If they should output the same result, then why we need both of them?

hey @ad26kr can you provide a few links on the reported differences between the two types of tokenizers?

cc @anthony who is the tokenizer expert

@anthony
After careful reading of those posts, I found most of the different results from tokenizers/fast-tokenizers are solved.

However, I am still curious about the reason to separate them, since I found them very different in computation speed. Why not just use the fast one?

oh that’s because we do not have rust implementations + python bindings for every type of tokenizer that’s released by the various research groups. by default transformers will look for the fast implementation if it exists, or fall back to the “slow” one when it doesn’t

1 Like

Hi @lewtun
I was just wondering if the only difference between fast tokenizers and python tokenizers is really just speed?
Reason being I saw that, for example for NLLB, the python tokenizer is based on SentencePiece while the fast tokenizer is based on BPE. Hence I was wondering if the output of the tokenizers is designed to be the same despite the difference in what it is based on?