Feature to highlight or color code the text from the NER output of token classification having offsets using python

pchhapolika · August 31, 2022, 11:32am

Feature request

I have fine tuned a Hugging face token classification model for NER task. I use pipeline from Hugging face to do prediction on test text data.

I tag the data as BIOL format. B stands of Beginning, I stand for Including, O means no entity, L means Last

Example:

Joh J Mathew will be tagged as B_PERSON I_PERSON L_PERSON

Here is how the output looks like:

model = AutoModelForTokenClassification.from_pretrained("model_x")
tokenizer = AutoTokenizer.from_pretrained("model_x")
    
token_classifier = pipeline("token-classification", model=model, aggregation_strategy="max",tokenizer=tokenizer)

text=("""'IOWA DRIVER LICENSE 1 SAMPLE 2 MARK LIMITED-TERM 8 123 NORTH STREET APT 201 DES MOINES, IA 50301-1234 
Onom d DL No. 123XX6789 4a iss 1107/2016 4b exp 01/12/2021 15 Sex M 16 Hgt 5\'-08" 18 Eyes BRO 9a End NONE 9 
Class C 12 Rest NONE Mark Sample DONOR MED ALERT: Y HEARING IMP: Y MED ADV DIR: Y 3 OOB 01/12/1967 5 
DD 12345678901234567890123 NIVIA AL NA LANG ---- QUE EROL DE USA 01/12/67""")

for ent in token_classifier(text):
    print(ent)

{'entity_group': 'B_LAST_NAME', 'score': 0.9999994, 'word': 'SAMPLE', 'start': 23, 'end': 29}
{'entity_group': 'B_FIRST_NAME', 'score': 0.99999905, 'word': '', 'start': 32, 'end': 33}
{'entity_group': 'L_FIRST_NAME', 'score': 0.9999949, 'word': 'MARK', 'start': 32, 'end': 36}
{'entity_group': 'B_ADDRESS', 'score': 0.9999989, 'word': '123', 'start': 52, 'end': 55}
{'entity_group': 'I_ADDRESS', 'score': 0.99999917, 'word': 'NORTHSTREETAPT201DESMOINES,IA', 'start': 56, 'end': 91}
{'entity_group': 'I_DRIVER_LICENSE_NUMBER', 'score': 0.9999995, 'word': '123XX6789', 'start': 118, 'end': 127}
{'entity_group': 'L_ISSUE_DATE', 'score': 0.99999964, 'word': '1107/2016', 'start': 135, 'end': 144}
{'entity_group': 'I_EXPIRY_DATE', 'score': 0.99999964, 'word': '01/12/2021', 'start': 152, 'end': 162}
{'entity_group': 'B_PERSON_NAME', 'score': 0.99999905, 'word': 'Mark', 'start': 234, 'end': 238}
{'entity_group': 'I_PERSON_NAME', 'score': 0.9999993, 'word': 'Sample', 'start': 239, 'end': 245}
{'entity_group': 'L_DATE_OF_BIRTH', 'score': 0.99999976, 'word': '01/12/1967', 'start': 301, 'end': 311}

So, given the offset values entity_group, word, start, end how can I highlight the original text with the entity_group so that it is easy to visulize.

Final Output

Is there any python-library that I can use to do it.?

Motivation

Makes easy to visualise the NER output

Your contribution

NA

rmckee · October 20, 2022, 2:46am

This looks like what you want. GitHub - natasha/ipymarkup: NER, syntax markup visualizations

Topic		Replies	Views
NER model fine tuning with labeled spans Beginners	5	3906	May 7, 2023
Tokenization in a NER context 🤗Tokenizers	5	5704	August 11, 2021
Token Classification Label order Intermediate	0	566	November 11, 2022
[HELP] NER task single sentence/sample prediction 🤗Transformers	2	1394	August 25, 2021
Application of a transformer model without fine tuning for NER task Beginners	2	1327	May 31, 2021

Feature to highlight or color code the text from the NER output of token classification having offsets using python

Feature request

Motivation

Your contribution

Related topics