I am trying to better understand when you should train a custom tokenizer and language model for your dataset. My go-to is spaCy and prodigy, but I realize there are limitations. Training a RoBERTa model or something similar with HuggingFace seems like the MLM could give you some advantages over what I would get with spaCy models plus prodigy Active Learning, just given the robustness of the model learning the domain context. My primary cases are NER & text classification.
For example, trying to extract products that have a loosely defined convention (DB-TGX-001 or TK-019), where you would need to ensure the tokenizer doesn’t split on the hyphens and you will most likely need to depend on the context to differentiate between different products based on how they are being described. spacy + Prodigy does pretty well, but I am wondering if training a custom tokenizer + language model with HF would be a better option.
Any suggestions or tips would be greatly appreciated.