Hi everyone. I have annotated a corpus with emails and websites and I fine-tuned bert for recognizing them but when I use it, it gives splitted results for example if in the sentence there is www.example.com, it gives 5 entities of type website (www,.,example,.,com). How can I solve?
Models like BERT are so-called subword token models, which means that they operate on a (subword) token level, rather than the word level. This means that a world like “hello” might get tokenized into multiple tokens “hel”, “lo”, and then BERT is trained to predict the labels for each individual token.
Hence after training, one needs to recombine the labels that BERT predicted at the token level back to the word level. Refer to my demo notebook which includes an inference section at the end.
Basically we can leverage the tokenizer as it has a feature called the offset_mapping which contains the character offsets of all the tokens. This section also explains it in more detail: Fast tokenizers’ special powers - Hugging Face NLP Course