It seems that common practice is to initialize a tokenizer by a model:
tokenizer = AutoTokenizer.from_pretrained(“model1”, model_max_length=512)
but can I train other models (model2) using a different models (model1’s) tokenizer?
I want to benchmark the performance of different models on a dataset, but it seems the dataset must be tokenized in a way that’s specific to a model.
From my experience and understanding in most cases different models will use different tokenisers. The tokeniser splits up and formats the input into the format the model is expecting the data to be in. If you try and use a different tokeniser to the model’s one, it may throw an error because now the data isn’t constructed the way it expects.
In the papers I’ve read, when they evaluate the performance of different models, the model’s tokenisers are included as part of this evaluation. Basically, they’re treated as part of the model. So, you can compare the performance of two models on the same dataset, using their own specific tokenisers - that is standard ML practice.
This stack overflow question might be of further help: https://stackoverflow.com/questions/72625528/translation-between-different-tokenizers
1 Like