Difference between tokenizer and tokenizerfast

Hi,

I have searched for the answer for my question, but still can’t get the clear answer.

Some issues in the github/forum also report that the result of tokenizer and tokenizerfast is a little bit different.

I want to know what is the difference between them (in terms of mechanism)?
If they should output the same result, then why we need both of them?

hey @ad26kr can you provide a few links on the reported differences between the two types of tokenizers?

cc @anthony who is the tokenizer expert

@anthony
After careful reading of those posts, I found most of the different results from tokenizers/fast-tokenizers are solved.

However, I am still curious about the reason to separate them, since I found them very different in computation speed. Why not just use the fast one?

oh that’s because we do not have rust implementations + python bindings for every type of tokenizer that’s released by the various research groups. by default transformers will look for the fast implementation if it exists, or fall back to the “slow” one when it doesn’t

1 Like