Building a custom label NER model for custom medical data. In my dataset, there are times where an entity may have non-entity words splitting it.
For a simple example, say I was designing data for to train for a person label that looked for first names. The sentence “His name was John Smith” would be O, O, O, B-PER, I-PER. That makes sense, but free-text gets messy. Imagine situations like this.
John, the man and legend, Smith, will be remembered forever.
Would Bert understand…
B-PER, O, O, O, O, I-PER, O O O O.
See how the split occurred? These should cause the same label, but I’m not sure if I should create IOB data as above, or have two separate instances of B-PER. The issue being. I want to model to understand that they are connected.
I’m playing with Bio-clinicalBert and it’s done well for ner. Just trying to get it to the next level.
Thanks in advanced, and I’d be happy to share more data if needed.