Hello,
I’m trying to implement NER with BioBERT.
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
tokenizer = AutoTokenizer.from_pretrained("dmis-lab/biobert-v1.1")
model = AutoModelForTokenClassification.from_pretrained("dmis-lab/biobert-v1.1")
nlp = pipeline("ner", model=model, tokenizer=tokenizer)
sentence = "This expression of NT-3 in supporting cells in embryos and neonates may even preserve in Brn3c null mutants the numerous spiral sensory neurons in the apex of 8-day old animals."
result = nlp(sentence)
print(result)
But the result isn’t what I’m expecting.
Some weights of BertForTokenClassification were not initialized from the model checkpoint at dmis-lab/biobert-v1.1 and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
[{'word': 'This', 'score': 0.5616263747215271, 'entity': 'LABEL_1', 'index': 1, 'start': 0, 'end': 4}, {'word': 'expression', 'score': 0.6285454630851746, 'entity': 'LABEL_1', 'index': 2,
The output is pretty clear : I need to train the model.
But, I’m not sure if with a trained model, I will manage to get rid off the ‘entity’: ‘LABEL_1’ issue.
My desired output would be something like:
https://bern.korea.ac.kr/
With a complete response such as:
{
"project": "BERN",
"sourcedb": "",
"sourceid": "43c1bfdebd3ccb8c9a42d10a22a3be3e8b2fe9ae7601b244b6318d71-Thread-18603546",
"text": "This expression of NT-3 in supporting cells in embryos and neonates may even preserve in Brn3c null mutants the numerous spiral sensory neurons in the apex of 8-day old animals.",
"denotations": [
{
"id": [
"HGNC:8020",
"BERN:324182202"
],
"span": {
"begin": 19,
"end": 23
},
"obj": "gene"
},
{
"id": [
"MIM:602460",
"HGNC:9220",
"Ensembl:ENSG00000091010",
"BERN:324351702"
],
"span": {
"begin": 89,
"end": 94
},
"obj": "gene"
}
],
"timestamp": "Thu May 27 08:22:14 +0000 2021",
"logits": {
"disease": [],
"gene": [
[
{
"start": 19,
"end": 23,
"id": "HGNC:8020\tBERN:324182202"
},
0.9999972581863403
],
[
{
"start": 89,
"end": 94,
"id": "MIM:602460\tHGNC:9220\tEnsembl:ENSG00000091010\tBERN:324351702"
},
0.9999972581863403
]
],
"drug": [],
"species": []
}
}
Am I in the right path to achieve that?
Any help/suggestion is more than welcome!
Cheers,
Vivian