NER model fine tuning with labeled spans

astarostap · October 9, 2021, 2:01am

Hi!

I’m looking to fine-tune an NER model (dslim/bert-base-NER-uncased) with my own data.

My annotations are of this form: for each example I have a piece of raw text (str) and a list of annotated spans of this form: {start_index: int, end_index: int, tag: str}

However, to fine-tune the NER model, I need to prepare X (tokens) and Y (token tags) for each example. So, those spans have to be translated into token tags, matching the model’s tokenizer.

Hope that makes sense.

Is there a way to handle this? Or, what would you recommend?

Thanks!

astarostap · October 23, 2021, 4:02pm

Hi folks! Does this make sense?

nielsr · October 25, 2021, 9:43am

Hi,

Yes that’s the right approach. You can take a look at Spacy’s offsets_to_biluo_tags method. It’s great to convert character index-level annotations to token annotations (in BILOU-format, which is a bit more exotic than IOB).

astarostap · October 25, 2021, 5:09pm

Thank you @nielsr!

The problem with that is that offsets_to_biluo_tags uses some spacy tokenizer right? And they use a different tokenizer from Huggingface BERT tokenizer for example. So don’t I need something that uses HF’s BERT tokenizer to go from spans to token labels?

zirui3 · October 26, 2021, 1:18pm

you can convert the span to BIOS tags(other tag schema is similar) like this:

def convert_ents_to_bio(text, ents):
    tags = ["O"] * len(text)
    for ent in ents:
        start, end, label = ent["start"], ent["end"], ent["label"]
        if start == end:
            tags[start] = 'S-'+ label
        else:
            tags[start] = 'B-' + label
            tags[start+1: end + 1] = ['I-'+label]*(end - start)
    return tags

input:
text = ["tom", "lives", "in", "Los", "Angeles"]
ents = [
    {"start":0, "end": 0, "label": "PERSON"},
    {"start":3, "end": 4, "label": "ADDRESS"},
]
convert_ents_to_bio(test, ents)
output:
['S-PERSON', 'O', 'O', 'B-ADDRESS', 'I-ADDRESS']

ben-epstein · May 7, 2023, 3:08pm

Hi @astarostap I understand your issue! I was struggling with the same thing, as I needed the tokens and tags to align to my chosen tokenizer, which was subject to change.

I created a python package to handle this for me because I couldn’t find it anywhere online: GitHub - Ben-Epstein/spacy-to-hf: A simple converter from SpaCy Entities (Spans) to Huggingface BILOU formatted data (tokens and ner_tags)

I hope it’s helpful!

Topic		Replies	Views
Improving NER BERT performing POS tagging Beginners	3	2151	November 23, 2020
Fine tuning NER BERT model on Phone numbers Beginners	3	1175	May 31, 2024
Create custom tags for fine-tuning Bert for NER task 🤗Datasets	0	882	January 22, 2024
How to handle <s> and </s> tags for custom NER using RoBERTa? Beginners	0	725	May 19, 2022
How to deal with differences between CoNLL 2003 dataset tokenisation and BER tokeniser when fine tuning NER model? Intermediate	6	2731	November 23, 2021

NER model fine tuning with labeled spans

Related topics