When should you train a custom tokenizer/language model?

mgiardinelli · October 9, 2021, 12:40pm

I am trying to better understand when you should train a custom tokenizer and language model for your dataset. My go-to is spaCy and prodigy, but I realize there are limitations. Training a RoBERTa model or something similar with HuggingFace seems like the MLM could give you some advantages over what I would get with spaCy models plus prodigy Active Learning, just given the robustness of the model learning the domain context. My primary cases are NER & text classification.

For example, trying to extract products that have a loosely defined convention (DB-TGX-001 or TK-019), where you would need to ensure the tokenizer doesn’t split on the hyphens and you will most likely need to depend on the context to differentiate between different products based on how they are being described. spacy + Prodigy does pretty well, but I am wondering if training a custom tokenizer + language model with HF would be a better option.

Any suggestions or tips would be greatly appreciated.

Topic		Replies	Views
Custom tokenizer: finetune model or retrain model? 🤗Transformers	1	902	March 8, 2024
Domain adaptation of Language Model and Tokenizer Beginners	8	2843	June 17, 2024
Fine tune a saved model with custom tokenizer 🤗Transformers	3	2959	December 15, 2020
How to "further pretrain" a tokenizer (do I need to do so?) 🤗Tokenizers	5	4381	February 20, 2022
How to train a model for ner pipeline [RoBERTa] Beginners	0	603	July 2, 2021

When should you train a custom tokenizer/language model?

Related topics