Do you need to use the associated tokenizer

nfouque · June 2, 2022, 11:59am

Hi,
This is a very beginner question,
I got used to using the Transformers library but I still don’t fully understand some of the mechanics behind it.
I’m doing Sequence classification btw.
For the tokenizers, we have to use the right class and model name corresponding to the model we want to use, and we use the same as the one for the model class.
What happens if you don’t use the correct tokenizer ? Can you implement your own tokenization and make it work with one of the existing classification models or will that be no good ?

cog · June 3, 2022, 12:44am

hi.

If you build your own tokenizer (it mean only build new vocab, config is same.),

you need to pre-train model first.

When model in pre-training, they train with correct tokenizer’s vocab dictionary.

regards.

lucasresck · June 6, 2022, 1:04am

Hi,

Complementing our colegue’s answer, the tokenizer is responsible to split your text into pieces of text (words, for instance) and convert those pieces into indices in the embedding matrix of your model. Each piece of text becomes a vector. If you change the tokenizer those two steps won’t work with your model in general.

You could implement your own tokenizer, of course, however it needs you to train the model from scratch. Another possibility is to adapt an existing tokenizer: for instance, if you realize BERT doesn’t recognize specific words from your dataset, you could include those words in the tokenizer’s vocabulary. If you are interested, please, take a look in this GitHub issue, specially in this comment.

Topic		Replies	Views
Do you have to use a model card's accompanying tokenizer? Beginners	1	307	November 4, 2022
Train a new tokenizer from scratch 🤗Transformers	4	1710	November 10, 2020
Questions about the connection between tokenizer and the model Beginners	0	308	September 19, 2023
Can we use tokenizer from one architecture and model from another one? Beginners	2	866	September 30, 2021
How to use pipeline for 'token-classification' with already tokenized input? Beginners	0	692	February 3, 2022

Do you need to use the associated tokenizer

Related topics