I am working on annotating a dataset for the purpose of named entity recognition.
In principle, I have seen that for multi-phrase (not single word) elements, annotations work like this (see this example below):
Romania ( B-CNT )
United States of America ( B-CNT C-CNT C-CNT C-CNT )
where B-CNT stands for “beginning-country” and C-CNT represents “continuing-country”.
The problem that I face is that I have a case in which (not related to countries) where I need to annotate like B-W GAP_WORD C-W C-W .
How should I proceed with the annotation in this case?
If I do annotate like in the schema above, should I expect a BERT -alike entity recognition system to learn and detect that a phrase can be like B-W GAP_WORD C-W C-W , or do I need that “C-W” (continuation word) to be exactly after the B-W (beginning word)?
Which solution is correct of the following 2:
B-W GAP_WORD C-W C-W
B-W GAP_WORD B-W C-W
And then, in case 2, find a way to make the connection between the B-Ws (actually corresponding to the same entity)?
Did you find a solution to this problem? I am working on this right now and want to label entities that are multiword. So far I have just labelled them all as individual words but its a pretty bad way to do this.