Do you need to use the associated tokenizer

This is a very beginner question,
I got used to using the Transformers library but I still don’t fully understand some of the mechanics behind it.
I’m doing Sequence classification btw.
For the tokenizers, we have to use the right class and model name corresponding to the model we want to use, and we use the same as the one for the model class.
What happens if you don’t use the correct tokenizer ? Can you implement your own tokenization and make it work with one of the existing classification models or will that be no good ?



If you build your own tokenizer (it mean only build new vocab, config is same.),

you need to pre-train model first.

When model in pre-training, they train with correct tokenizer’s vocab dictionary.



Complementing our colegue’s answer, the tokenizer is responsible to split your text into pieces of text (words, for instance) and convert those pieces into indices in the embedding matrix of your model. Each piece of text becomes a vector. If you change the tokenizer those two steps won’t work with your model in general.

You could implement your own tokenizer, of course, however it needs you to train the model from scratch. Another possibility is to adapt an existing tokenizer: for instance, if you realize BERT doesn’t recognize specific words from your dataset, you could include those words in the tokenizer’s vocabulary. If you are interested, please, take a look in this GitHub issue, specially in this comment.