Annotate a NER dataset (for BERT)

I am working on annotating a dataset for the purpose of named entity recognition.

In principle, I have seen that for multi-phrase (not single word) elements, annotations work like this (see this example below):

  1. Romania ( B-CNT )
  2. United States of America ( B-CNT C-CNT C-CNT C-CNT )

where B-CNT stands for “beginning-country” and C-CNT represents “continuing-country”.

The problem that I face is that I have a case in which (not related to countries) where I need to annotate like B-W GAP_WORD C-W C-W .

How should I proceed with the annotation in this case?

If I do annotate like in the schema above, should I expect a BERT -alike entity recognition system to learn and detect that a phrase can be like B-W GAP_WORD C-W C-W , or do I need that “C-W” (continuation word) to be exactly after the B-W (beginning word)?

Which solution is correct of the following 2:

  1. B-W GAP_WORD C-W C-W
  2. B-W GAP_WORD B-W C-W

And then, in case 2, find a way to make the connection between the B-Ws (actually corresponding to the same entity)?

Did you find a solution to this problem? I am working on this right now and want to label entities that are multiword. So far I have just labelled them all as individual words but its a pretty bad way to do this.

Hey TG1. I am really sorry to see your message only now. I went forward with the second approach and had decent results at that time.

Hi @Calin, hello @TG1

it seems our problems are very similar, look:

What I want to achieve: to find a model recognizing these place:

   Syracuse, NY
   Athens, United States

as ONE entity. Well, Athens, United States ( B-CNT C-CNT C-CNT) would be fine too.

Very similar to your problem, isn’t it?