Bert Ner finetuning issue

ale-volpe · May 5, 2024, 8:09pm

Problem with Bert Ner finetuning

Hi everyone. I have annotated a corpus with emails and websites and I fine-tuned bert for recognizing them but when I use it, it gives splitted results for example if in the sentence there is www.example.com, it gives 5 entities of type website (www,.,example,.,com). How can I solve?

nielsr · May 6, 2024, 7:43am

Hi,

Models like BERT are so-called subword token models, which means that they operate on a (subword) token level, rather than the word level. This means that a world like “hello” might get tokenized into multiple tokens “hel”, “lo”, and then BERT is trained to predict the labels for each individual token.

Hence after training, one needs to recombine the labels that BERT predicted at the token level back to the word level. Refer to my demo notebook which includes an inference section at the end.

Basically we can leverage the tokenizer as it has a feature called the offset_mapping which contains the character offsets of all the tokens. This section also explains it in more detail: Fast tokenizers’ special powers - Hugging Face NLP Course

ale-volpe · May 6, 2024, 8:52am

Hi,
thank you so much for the detailed explanation.
I will check it out.

Best regards

system · May 6, 2024, 8:52pm

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Named Entity Recognition: fine-tune or create new model? Beginners	3	3544	February 11, 2023
Improving NER BERT performing POS tagging Beginners	3	2151	November 23, 2020
Token Classification as Pre-training task Models	0	287	September 20, 2022
Fine tune BERT for NER task Beginners	0	267	April 21, 2022
Initialising BERT Model Models	0	276	June 9, 2023

Bert Ner finetuning issue

Problem with Bert Ner finetuning

Related topics