Do you have to use a model card's accompanying tokenizer?

seanswyi · November 3, 2022, 5:18am

Hi. I’m currently using BERT-like models (e.g., bert-base-cased, bert-base-multilingual-cased) for a project at work. The data that I’m using produces a lot of [UNK] using readily available tokenizers and so I wanted to create my own tokenizer. However, I’m wondering if I can just make a new tokenizer with a new vocabulary and use that with one of the standard model cards.

My thinking is that it won’t properly work, since the available models have been pretrained using their own tokenizers. However, I’m curious if this approach would be viable. Thanks.

adorkin · November 4, 2022, 10:52am

Yeah, it won’t work well. With a new tokenizer, you’d have retrain the embedding layer at the very least (it could be of different size as well). But I’d say this is a pretty radical solution. I mean these tokenizers have individual characters as tokens as well, so there really shouldn’t be that many [UNK], unless you have characters there that weren’t present in the training data. I think, it should be possible to extend the tokenizers vocabulary with additional tokens, if you know what these are. You would still have to train the embeddings for them though, so it probably won’t work well right off the bat.

Topic		Replies	Views
Do you need to use the associated tokenizer Beginners	2	569	June 6, 2022
Questions about the connection between tokenizer and the model Beginners	0	308	September 19, 2023
Load pretrained model's tokenizer with or without vocabulary? Beginners	2	153	August 30, 2024
Customize tokenizer in model card's widget Beginners	1	354	November 3, 2020
Working with named entities with bert Beginners	2	316	August 30, 2020

Do you have to use a model card's accompanying tokenizer?

Related topics