BioBERT NER issue

Vivian · May 27, 2021, 8:31am

Hello,

I’m trying to implement NER with BioBERT.

from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline

tokenizer = AutoTokenizer.from_pretrained("dmis-lab/biobert-v1.1")
model = AutoModelForTokenClassification.from_pretrained("dmis-lab/biobert-v1.1")

nlp = pipeline("ner", model=model, tokenizer=tokenizer)
sentence = "This expression of NT-3 in supporting cells in embryos and neonates may even preserve in Brn3c null mutants the numerous spiral sensory neurons in the apex of 8-day old animals."

result = nlp(sentence)
print(result)

But the result isn’t what I’m expecting.

Some weights of BertForTokenClassification were not initialized from the model checkpoint at dmis-lab/biobert-v1.1 and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
[{'word': 'This', 'score': 0.5616263747215271, 'entity': 'LABEL_1', 'index': 1, 'start': 0, 'end': 4}, {'word': 'expression', 'score': 0.6285454630851746, 'entity': 'LABEL_1', 'index': 2,

The output is pretty clear : I need to train the model.
But, I’m not sure if with a trained model, I will manage to get rid off the ‘entity’: ‘LABEL_1’ issue.

My desired output would be something like:
https://bern.korea.ac.kr/

With a complete response such as:

{
    "project": "BERN",
    "sourcedb": "",
    "sourceid": "43c1bfdebd3ccb8c9a42d10a22a3be3e8b2fe9ae7601b244b6318d71-Thread-18603546",
    "text": "This expression of NT-3 in supporting cells in embryos and neonates may even preserve in Brn3c null mutants the numerous spiral sensory neurons in the apex of 8-day old animals.",
    "denotations": [
        {
            "id": [
                "HGNC:8020",
                "BERN:324182202"
            ],
            "span": {
                "begin": 19,
                "end": 23
            },
            "obj": "gene"
        },
        {
            "id": [
                "MIM:602460",
                "HGNC:9220",
                "Ensembl:ENSG00000091010",
                "BERN:324351702"
            ],
            "span": {
                "begin": 89,
                "end": 94
            },
            "obj": "gene"
        }
    ],
    "timestamp": "Thu May 27 08:22:14 +0000 2021",
    "logits": {
        "disease": [],
        "gene": [
            [
                {
                    "start": 19,
                    "end": 23,
                    "id": "HGNC:8020\tBERN:324182202"
                },
                0.9999972581863403
            ],
            [
                {
                    "start": 89,
                    "end": 94,
                    "id": "MIM:602460\tHGNC:9220\tEnsembl:ENSG00000091010\tBERN:324351702"
                },
                0.9999972581863403
            ]
        ],
        "drug": [],
        "species": []
    }
}

Am I in the right path to achieve that?
Any help/suggestion is more than welcome!

Cheers,
Vivian

shoaibb · May 30, 2021, 2:49pm

@Vivian Did you get any success with this issue? I am also in a similar situation.

Vivian · May 30, 2021, 3:51pm

Nope.

In order to get something reliable, we deployed this:

Let’s see in the future how we could do differently

srishti-hf1110 · April 18, 2022, 2:19pm

Hi @Vivian in your original issue, it prompts to fine tune the model on a downstream task. If I’m not wrong, we would need labelled data for that. I’m specifically interested in tagging diseases in Pubmed files, not sure how would I be able to fine tune BioBert for this task. Do you have any idea?

Also, regarding BERN (& BERN2), is there a hugging face implementation available? I checked the link you attached & apparently ~70 gb disk space shall be required to be able to use BERN for NER. I’m willing to do these things in google colab. Any idea how should I go about things? Or do you have any experience with NER on biomed text data?

Any help is highly appreciated

Regards,
Srishti

Vivian · April 18, 2022, 7:44pm

Hi,

I’m still using BERN with some good results.
I did not find a BERN model with HuggingFace.
Perhaps in the future?

I will take a look at BERN2.
Good luck!

srishti-hf1110 · April 19, 2022, 5:40am

Okay, how are you using it? I mean is it the web api or did you get the code for it running on your system or cloud.

I want to use it for multiple files so web api isn’t the way for me, could you please share how can I get it running on my system?

Vivian · April 19, 2022, 6:12am

We downloaded the BERN project in order to run it privately.
Definitely, the web API is not tailored for this kind of purpose.

Unfortunately I can’t share anything else than the useful BERN readme.

Perhaps you should ask to a WebDev/DevOps to deploy this solution.

datajazz · November 27, 2022, 3:55pm

Hello @srishti-hf1110 and @Vivian ! Thank you for posting this discussion as I am working with biobert/bern also and what you’ve written here has helped me find with I need to get stated. I am using multiple documents also(youtube comments) and created some functions to split each document into 5000 characters or less and then run them through the BERN Api.

Feel free to use any of my project by downloading the functions using : pip install biobert-bern

or feel free to use anything from the git repo.

I’ve also posted a more readable website version created with nbdev and quarto. Please let me know if this is useful to anyone!

Topic		Replies	Views
How to do NER with a BERT type models? Beginners	0	333	April 18, 2022
BERT Split NER Labeling Intermediate	1	1061	December 7, 2021
How to fine tune bert on entity recognition? Beginners	23	7361	November 21, 2022
BERT for NER output of only '0' Beginners	0	671	November 14, 2021
Fine tune BERT for NER task Beginners	0	267	April 21, 2022

BioBERT NER issue

Related topics