Hi,
I’m trying to use the above mentioned model for token classification.
Below is my sample text:
00:00:02 Speaker 1: hi john, it’s nice to see you again. how was your weekend? do anything special? 00:00:06 Speaker 2: yep, all good thanks. i was with my sister in derby. We saw, you know, that james bond film. what’s it called? then got a couple of drinks at the pitcher and piano, back in nottingham. 00:00:18 Speaker 1: that’s close to your flat, right? 00:00:25 Speaker 2: yeah, about five minutes away. i live on parliament street, remember? 00:00:39 Speaker 1: of course, i remember. you moved last year after you left your parents’ place. 00:00:39 Speaker 2: yeah, it was my sister’s birthday on sunday, susie, the older one. i told you last time about that new job she got. sainsbury’s, the one by victoria centre.
When using the hosted interface API, the output is excellent:
And here is the json output from the hosted API:
[
{
"entity_group": "PER",
"score": 0.9778427481651306,
"word": "john",
"start": 23,
"end": 27
},
{
"entity_group": "LOC",
"score": 0.9929279685020447,
"word": "derby",
"start": 166,
"end": 171
},
{
"entity_group": "MISC",
"score": 0.7170370817184448,
"word": "james bond",
"start": 196,
"end": 206
},
{
"entity_group": "LOC",
"score": 0.993842363357544,
"word": "nottingham",
"start": 293,
"end": 303
},
{
"entity_group": "LOC",
"score": 0.9108084440231323,
"word": "parliament street",
"start": 420,
"end": 437
},
{
"entity_group": "PER",
"score": 0.9840036034584045,
"word": "susie",
"start": 613,
"end": 618
},
{
"entity_group": "ORG",
"score": 0.9001737236976624,
"word": "sai",
"start": 684,
"end": 687
},
{
"entity_group": "LOC",
"score": 0.9343950748443604,
"word": "##nsbury's",
"start": 687,
"end": 695
},
{
"entity_group": "LOC",
"score": 0.7310423851013184,
"word": "victoria centre",
"start": 708,
"end": 723
}
]
But when used via the python API using the following code:
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline
tokenizer = AutoTokenizer.from_pretrained("dslim/bert-base-NER-uncased")
model = AutoModelForTokenClassification.from_pretrained("dslim/bert-base-NER-uncased")
nlp = pipeline("token-classification", model=model, tokenizer=tokenizer)
example = """00:00:02 Speaker 1: hi john, it's nice to see you again. how was your weekend? do anything special? 00:00:06 Speaker 2: yep, all good thanks. i was with my sister in derby. We saw, you know, that james bond film. what's it called? then got a couple of drinks at the pitcher and piano, back in nottingham. 00:00:18 Speaker 1: that's close to your flat, right? 00:00:25 Speaker 2: yeah, about five minutes away. i live on parliament street, remember? 00:00:39 Speaker 1: of course, i remember. you moved last year after you left your parents' place. 00:00:39 Speaker 2: yeah, it was my sister's birthday on sunday, susie, the older one. i told you last time about that new job she got. sainsbury's, the one by victoria centre."""
ner_results = nlp(example)
print(ner_results)
print(len(ner_results))
I get very different results, here is the output of the code:
[{'entity': 'B-PER', 'score': 0.97784275, 'index': 10, 'word': 'john', 'start': 23, 'end': 27}, {'entity': 'B-LOC', 'score': 0.99292797, 'index': 50, 'word': 'derby', 'start': 166, 'end': 171}, {'entity': 'B-MISC', 'score': 0.8592305, 'index': 59, 'word': 'james', 'start': 196, 'end': 201}, {'entity': 'I-MISC', 'score': 0.5748464, 'index': 60, 'word': 'bond', 'start': 202, 'end': 206}, {'entity': 'B-LOC', 'score': 0.9938424, 'index': 83, 'word': 'nottingham', 'start': 293, 'end': 303}, {'entity': 'B-LOC', 'score': 0.8480199, 'index': 121, 'word': 'parliament', 'start': 420, 'end': 430}, {'entity': 'I-LOC', 'score': 0.973597, 'index': 122, 'word': 'street', 'start': 431, 'end': 437}, {'entity': 'B-PER', 'score': 0.9840036, 'index': 172, 'word': 'susie', 'start': 613, 'end': 618}, {'entity': 'B-ORG', 'score': 0.90017325, 'index': 190, 'word': 'sai', 'start': 684, 'end': 687}, {'entity': 'I-LOC', 'score': 0.93890965, 'index': 191, 'word': '##ns', 'start': 687, 'end': 689}, {'entity': 'I-LOC', 'score': 0.8916274, 'index': 192, 'word': '##bury', 'start': 689, 'end': 693}, {'entity': 'I-LOC', 'score': 0.9475074, 'index': 193, 'word': "'", 'start': 693, 'end': 694}, {'entity': 'I-LOC', 'score': 0.9595369, 'index': 194, 'word': 's', 'start': 694, 'end': 695}, {'entity': 'B-LOC', 'score': 0.55478203, 'index': 199, 'word': 'victoria', 'start': 708, 'end': 716}, {'entity': 'I-LOC', 'score': 0.90730333, 'index': 200, 'word': 'centre', 'start': 717, 'end': 723}]
As can be seen its detecting 15 entities which is way more than the hosted API. And it is even detecting 's
as an I-LOC
which is very wrong and makes the result unusable.
Why the difference in results? Am I doing something wrong in the code?
Thanks