BERT Split NER Labeling

Building a custom label NER model for custom medical data. In my dataset, there are times where an entity may have non-entity words splitting it.

For a simple example, say I was designing data for to train for a person label that looked for first names. The sentence “His name was John Smith” would be O, O, O, B-PER, I-PER. That makes sense, but free-text gets messy. Imagine situations like this.

John, the man and legend, Smith, will be remembered forever.

Would Bert understand…

B-PER, O, O, O, O, I-PER, O O O O.

See how the split occurred? These should cause the same label, but I’m not sure if I should create IOB data as above, or have two separate instances of B-PER. The issue being. I want to model to understand that they are connected.

I’m playing with Bio-clinicalBert and it’s done well for ner. Just trying to get it to the next level.

Thanks in advanced, and I’d be happy to share more data if needed.


Will keep playing on my own in the meantime