NER model fine tuning with labeled spans

Hi!

I’m looking to fine-tune an NER model (dslim/bert-base-NER-uncased) with my own data.

My annotations are of this form: for each example I have a piece of raw text (str) and a list of annotated spans of this form: {start_index: int, end_index: int, tag: str}

However, to fine-tune the NER model, I need to prepare X (tokens) and Y (token tags) for each example. So, those spans have to be translated into token tags, matching the model’s tokenizer.

Hope that makes sense.

Is there a way to handle this? Or, what would you recommend?

Thanks!

1 Like

Hi folks! Does this make sense?

Hi,

Yes that’s the right approach. You can take a look at Spacy’s offsets_to_biluo_tags method. It’s great to convert character index-level annotations to token annotations (in BILOU-format, which is a bit more exotic than IOB).

Thank you @nielsr!

The problem with that is that offsets_to_biluo_tags uses some spacy tokenizer right? And they use a different tokenizer from Huggingface BERT tokenizer for example. So don’t I need something that uses HF’s BERT tokenizer to go from spans to token labels?

you can convert the span to BIOS tags(other tag schema is similar) like this:

def convert_ents_to_bio(text, ents):
    tags = ["O"] * len(text)
    for ent in ents:
        start, end, label = ent["start"], ent["end"], ent["label"]
        if start == end:
            tags[start] = 'S-'+ label
        else:
            tags[start] = 'B-' + label
            tags[start+1: end + 1] = ['I-'+label]*(end - start)
    return tags

input:
text = ["tom", "lives", "in", "Los", "Angeles"]
ents = [
    {"start":0, "end": 0, "label": "PERSON"},
    {"start":3, "end": 4, "label": "ADDRESS"},
]
convert_ents_to_bio(test, ents)
output:
['S-PERSON', 'O', 'O', 'B-ADDRESS', 'I-ADDRESS']
1 Like

Hi @astarostap I understand your issue! I was struggling with the same thing, as I needed the tokens and tags to align to my chosen tokenizer, which was subject to change.

I created a python package to handle this for me because I couldn’t find it anywhere online: GitHub - Ben-Epstein/spacy-to-hf: A simple converter from SpaCy Entities (Spans) to Huggingface BILOU formatted data (tokens and ner_tags)

I hope it’s helpful!