Identifying most useful domain-specific tokens for adding to the existing tokenizer

mkarlos · October 12, 2023, 10:00pm

I am using a model for token classification task on a medical domain. Unfortunately, I don’t have enough data to set up a new tokenizer and train a new model from scratch, so I am using the existing bert-based model and fine-tuning it. I want, however, to add some domain-specific words/tokens to boost the performance.

My initial thought was to make a new WordPiece tokenizer with limited vocabulary size on the medical domain and add tokens to the pre-trained tokenizer that are missing from there. However, I came up with this article that suggests to use SpaCy tokenizer and add only words, rather than tokens, as the new tokens might mess up the existing logic of the pre-trained tokenizer.

Any suggestion of which approach might be better?

kumarme072 · February 2, 2024, 5:29pm

I think you are doing right thing.

Topic		Replies	Views
Replace special [unusedX] tokens in a tokenizer to add domain-specific words Intermediate	0	1095	October 12, 2023
Process to adding new tokens to a corpus and subsequently training the corresponding word embeddings Beginners	0	3761	April 21, 2021
Using a pretrained tokenizer vs training a one from scratch 🤗Tokenizers	1	866	August 21, 2020
Do you have to use a model card's accompanying tokenizer? Beginners	1	307	November 4, 2022
Word Specific Classification (custom token classification?) Beginners	0	76	May 28, 2024

Identifying most useful domain-specific tokens for adding to the existing tokenizer

Related topics