Can i use a tokenizer x for a model y

surya-narayanan · April 20, 2023, 4:51am

It seems that common practice is to initialize a tokenizer by a model:

tokenizer = AutoTokenizer.from_pretrained(“model1”, model_max_length=512)

but can I train other models (model2) using a different models (model1’s) tokenizer?

I want to benchmark the performance of different models on a dataset, but it seems the dataset must be tokenized in a way that’s specific to a model.

CelineLind · April 20, 2023, 6:04am

From my experience and understanding in most cases different models will use different tokenisers. The tokeniser splits up and formats the input into the format the model is expecting the data to be in. If you try and use a different tokeniser to the model’s one, it may throw an error because now the data isn’t constructed the way it expects.

In the papers I’ve read, when they evaluate the performance of different models, the model’s tokenisers are included as part of this evaluation. Basically, they’re treated as part of the model. So, you can compare the performance of two models on the same dataset, using their own specific tokenisers - that is standard ML practice.

This stack overflow question might be of further help: https://stackoverflow.com/questions/72625528/translation-between-different-tokenizers

Topic		Replies	Views
Help defining tokenizer 🤗Tokenizers	0	282	April 28, 2023
Do you need to use the associated tokenizer Beginners	2	569	June 6, 2022
Employing Different Tokenizers in a Translation Model Models	0	216	July 27, 2023
Do you have to use a model card's accompanying tokenizer? Beginners	1	307	November 4, 2022
Creating custom model Beginners	0	680	June 2, 2021

Can i use a tokenizer x for a model y

Related topics