Newbie: Main difference between tokenizers?

Hi, HuggingFace community.

I’m new to NLP and have learned there are many packages that can do tokenization, such as spacy, NLTK, torchtext and of course HuggingFace tokenizer. My questions are:

:one: What is the main difference between them?

It seems to me that :hugs: tokenizers are fast and match with corresponding models. Any difference other than that?

:two: Why do different models need different tokenizers?

I heard that using wrong tokenizers may hurt performance, so tokenizers make a difference. Why is that the case, any intuition?

:three: Which should I try first?

If I am to design my model (toy ones) instead of directly using provided API, which tokenizer should I try first.

1 Like