Feature request
I have fine tuned a Hugging face
token classification
model for NER task
. I use pipeline
from Hugging face to do prediction on test text data.
I tag the data as BIOL
format. B stands of Beginning, I stand for Including, O means no entity, L means Last
Example:
Joh J Mathew
will be tagged as B_PERSON
I_PERSON
L_PERSON
Here is how the output looks like:
model = AutoModelForTokenClassification.from_pretrained("model_x")
tokenizer = AutoTokenizer.from_pretrained("model_x")
token_classifier = pipeline("token-classification", model=model, aggregation_strategy="max",tokenizer=tokenizer)
text=("""'IOWA DRIVER LICENSE 1 SAMPLE 2 MARK LIMITED-TERM 8 123 NORTH STREET APT 201 DES MOINES, IA 50301-1234
Onom d DL No. 123XX6789 4a iss 1107/2016 4b exp 01/12/2021 15 Sex M 16 Hgt 5\'-08" 18 Eyes BRO 9a End NONE 9
Class C 12 Rest NONE Mark Sample DONOR MED ALERT: Y HEARING IMP: Y MED ADV DIR: Y 3 OOB 01/12/1967 5
DD 12345678901234567890123 NIVIA AL NA LANG ---- QUE EROL DE USA 01/12/67""")
for ent in token_classifier(text):
print(ent)
{'entity_group': 'B_LAST_NAME', 'score': 0.9999994, 'word': 'SAMPLE', 'start': 23, 'end': 29}
{'entity_group': 'B_FIRST_NAME', 'score': 0.99999905, 'word': '', 'start': 32, 'end': 33}
{'entity_group': 'L_FIRST_NAME', 'score': 0.9999949, 'word': 'MARK', 'start': 32, 'end': 36}
{'entity_group': 'B_ADDRESS', 'score': 0.9999989, 'word': '123', 'start': 52, 'end': 55}
{'entity_group': 'I_ADDRESS', 'score': 0.99999917, 'word': 'NORTHSTREETAPT201DESMOINES,IA', 'start': 56, 'end': 91}
{'entity_group': 'I_DRIVER_LICENSE_NUMBER', 'score': 0.9999995, 'word': '123XX6789', 'start': 118, 'end': 127}
{'entity_group': 'L_ISSUE_DATE', 'score': 0.99999964, 'word': '1107/2016', 'start': 135, 'end': 144}
{'entity_group': 'I_EXPIRY_DATE', 'score': 0.99999964, 'word': '01/12/2021', 'start': 152, 'end': 162}
{'entity_group': 'B_PERSON_NAME', 'score': 0.99999905, 'word': 'Mark', 'start': 234, 'end': 238}
{'entity_group': 'I_PERSON_NAME', 'score': 0.9999993, 'word': 'Sample', 'start': 239, 'end': 245}
{'entity_group': 'L_DATE_OF_BIRTH', 'score': 0.99999976, 'word': '01/12/1967', 'start': 301, 'end': 311}
So, given the offset values entity_group
, word
, start
, end
how can I highlight the original text with the entity_group
so that it is easy to visulize.
Final Output
Is there any python-library that I can use to do it.?
Motivation
Makes easy to visualise the NER output
Your contribution
NA