How to get string offsets from custom NER pipeline?

AndreaSottana · November 23, 2021, 2:56pm

Hello,
I have fine-tuned a custom NER model, and when doing inference, I would like a way to convert from token-level predictions back to the offsets of the original string.
Take the example below

from transformers import pipeline
ner = pipeline("ner", model='dbmdz/bert-large-cased-finetuned-conll03-english')
print(ner("I went to Rome yesterday to watch Rome vs Bayer Munich"))

I get this output

[{'entity': 'I-LOC', 'score': 0.99968606, 'index': 4, 'word': 'Rome', 'start': 10, 'end': 14}, 
{'entity': 'I-ORG', 'score': 0.9314538, 'index': 8, 'word': 'Rome', 'start': 32, 'end': 36}, 
{'entity': 'I-ORG', 'score': 0.99915195, 'index': 10, 'word': 'Bay', 'start': 40, 'end': 43}, 
{'entity': 'I-ORG', 'score': 0.99730355, 'index': 11, 'word': '##er', 'start': 43, 'end': 45},
{'entity': 'I-ORG', 'score': 0.9979789, 'index': 12, 'word': 'Munich', 'start': 46, 'end': 52}]

As you can see the entities are split into tokens (e.g. “Bayer Munich” takes 3 tokens). What I’d like to get is something like this

[
  {
    "entity_group": "LOC",
    "score": 0.9996861219406128,
    "word": "Rome",
    "start": 10,
    "end": 14
  },
  {
    "entity_group": "ORG",
    "score": 0.9314539432525635,
    "word": "Rome",
    "start": 32,
    "end": 36
  },
  {
    "entity_group": "ORG",
    "score": 0.9981448650360107,
    "word": "Bayer Munich",
    "start": 40,
    "end": 52
  }
]

Is there a clean way to achieve this, or do I have to write a function to do this myself?
Please give answers which are also applicable to custom models not available in the HuggingFace Hub, as ultimately my pipeline would be something like

ner = pipeline("ner", model='my_trained_model', tokenizer=AutoTokenizer.from_pretrained("...")

where my_trained_model is the path to the folder containing my model trained with the transformers.Trainer

Topic		Replies	Views
Getting entity offset from ONNX outputs Intermediate	1	584	April 28, 2022
Unable to get NER tags from "ner" pipeline? Beginners	0	520	October 7, 2020
How to get NER pipeline output to match with spacy's output? 🤗Transformers	3	2080	July 12, 2020
Inconsistency in Model Output [ Token Classification] 🤗Transformers	0	332	April 12, 2023
NER model fine tuning with labeled spans Beginners	5	3881	May 7, 2023

How to get string offsets from custom NER pipeline?

Related topics