NER for short technical phrases

I’m researching the training of a transformer model to identify labels from IoT devices. I have 1M training examples, each is relatively short (< 128 chars, usually shorter). The text is English but it’s full of abbreviations and generally doesn’t contain spaces between the “words”. I’ve trained a tokeniser based on BPE with custom pretokenisation which works very well, and now I’m trying to decide which model to use.

RoBERTa seems like a good option since it doesn’t use NSP - that makes no sense as my sentences are more or less independent (they do come in groups though, one equipment will have a number of sensors) .

I’m looking for opinions as to which model to use and what sizes. I trained my tokeniser size to 3000 - that decision was slightly arbitrary but I based it on the fact that the vocab is much smaller than say a full English corpus.

Any opinions welcome! The goal is NER as well as classification (and any other applications I can come up with)