Apology in advance if this question might have already been asked. However, I have not been able to find a convincing answer or the optimum way to deal with this issue.
To my understanding, NER makes a prediction at a token level. Since BERT is using a sub-word tokenizer, it is entirely possible that some part of the word won’t get labeled or we have a different label within the same word. Both of these are undesirables because in the end we want the final result to be NER on a word, not token, level.
For example, see link.
Barr(PER) ien(O) tos(PER)
This should have been Barr(PER) ien(PER) tos(PER).
Another more confusing prediction here
F(MISC) abric (O) … Fat (LOC) pack Sweater (ORG)
We have inconsistent token production within the same words.
So my questions are the following
-
How can I best convert token-level NER labels to word-level labels? What is the best policy to deal with inconsistent token-level prediction within the same word? Is there a standard way to do this? Has this already been implemented in the huggingface library.
-
Should not there be a way to force the model to recognize that those three tokens came from the same word so they need to have the same token-level label in the first place?
Any suggestions are greatly appreciated. Thank you!