Hello,
I have fine-tuned a custom NER model, and when doing inference, I would like a way to convert from token-level predictions back to the offsets of the original string.
Take the example below
from transformers import pipeline
ner = pipeline("ner", model='dbmdz/bert-large-cased-finetuned-conll03-english')
print(ner("I went to Rome yesterday to watch Rome vs Bayer Munich"))
I get this output
[{'entity': 'I-LOC', 'score': 0.99968606, 'index': 4, 'word': 'Rome', 'start': 10, 'end': 14},
{'entity': 'I-ORG', 'score': 0.9314538, 'index': 8, 'word': 'Rome', 'start': 32, 'end': 36},
{'entity': 'I-ORG', 'score': 0.99915195, 'index': 10, 'word': 'Bay', 'start': 40, 'end': 43},
{'entity': 'I-ORG', 'score': 0.99730355, 'index': 11, 'word': '##er', 'start': 43, 'end': 45},
{'entity': 'I-ORG', 'score': 0.9979789, 'index': 12, 'word': 'Munich', 'start': 46, 'end': 52}]
As you can see the entities are split into tokens (e.g. “Bayer Munich” takes 3 tokens). What I’d like to get is something like this
[
{
"entity_group": "LOC",
"score": 0.9996861219406128,
"word": "Rome",
"start": 10,
"end": 14
},
{
"entity_group": "ORG",
"score": 0.9314539432525635,
"word": "Rome",
"start": 32,
"end": 36
},
{
"entity_group": "ORG",
"score": 0.9981448650360107,
"word": "Bayer Munich",
"start": 40,
"end": 52
}
]
Is there a clean way to achieve this, or do I have to write a function to do this myself?
Please give answers which are also applicable to custom models not available in the HuggingFace Hub, as ultimately my pipeline would be something like
ner = pipeline("ner", model='my_trained_model', tokenizer=AutoTokenizer.from_pretrained("...")
where my_trained_model
is the path to the folder containing my model trained with the transformers.Trainer