NER for short technical phrases

david-waterworth · December 16, 2020, 8:54am

I’m researching the training of a transformer model to identify labels from IoT devices. I have 1M training examples, each is relatively short (< 128 chars, usually shorter). The text is English but it’s full of abbreviations and generally doesn’t contain spaces between the “words”. I’ve trained a tokeniser based on BPE with custom pretokenisation which works very well, and now I’m trying to decide which model to use.

RoBERTa seems like a good option since it doesn’t use NSP - that makes no sense as my sentences are more or less independent (they do come in groups though, one equipment will have a number of sensors) .

I’m looking for opinions as to which model to use and what sizes. I trained my tokeniser size to 3000 - that decision was slightly arbitrary but I based it on the fact that the vocab is much smaller than say a full English corpus.

Any opinions welcome! The goal is NER as well as classification (and any other applications I can come up with)

Topic		Replies	Views
Generating NER data sample for RoBERTa model 🤗Transformers	3	472	September 14, 2020
How to train a model for ner pipeline [RoBERTa] Beginners	0	603	July 2, 2021
Multilingual NER pretrained model fine tuning Models	0	324	December 9, 2023
How to handle <s> and </s> tags for custom NER using RoBERTa? Beginners	0	725	May 19, 2022
How do I use a fine-tuned Trainer model for inference correctly? 🤗Transformers	0	981	June 9, 2023

NER for short technical phrases

Related topics