BioBERT NER issue


I’m trying to implement :hugs: NER with BioBERT.

from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline

tokenizer = AutoTokenizer.from_pretrained("dmis-lab/biobert-v1.1")
model = AutoModelForTokenClassification.from_pretrained("dmis-lab/biobert-v1.1")

nlp = pipeline("ner", model=model, tokenizer=tokenizer)
sentence = "This expression of NT-3 in supporting cells in embryos and neonates may even preserve in Brn3c null mutants the numerous spiral sensory neurons in the apex of 8-day old animals."

result = nlp(sentence)

But the result isn’t what I’m expecting.

Some weights of BertForTokenClassification were not initialized from the model checkpoint at dmis-lab/biobert-v1.1 and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
[{'word': 'This', 'score': 0.5616263747215271, 'entity': 'LABEL_1', 'index': 1, 'start': 0, 'end': 4}, {'word': 'expression', 'score': 0.6285454630851746, 'entity': 'LABEL_1', 'index': 2,

The output is pretty clear : I need to train the model.
But, I’m not sure if with a trained model, I will manage to get rid off the ‘entity’: ‘LABEL_1’ issue.

My desired output would be something like:

With a complete response such as:

    "project": "BERN",
    "sourcedb": "",
    "sourceid": "43c1bfdebd3ccb8c9a42d10a22a3be3e8b2fe9ae7601b244b6318d71-Thread-18603546",
    "text": "This expression of NT-3 in supporting cells in embryos and neonates may even preserve in Brn3c null mutants the numerous spiral sensory neurons in the apex of 8-day old animals.",
    "denotations": [
            "id": [
            "span": {
                "begin": 19,
                "end": 23
            "obj": "gene"
            "id": [
            "span": {
                "begin": 89,
                "end": 94
            "obj": "gene"
    "timestamp": "Thu May 27 08:22:14 +0000 2021",
    "logits": {
        "disease": [],
        "gene": [
                    "start": 19,
                    "end": 23,
                    "id": "HGNC:8020\tBERN:324182202"
                    "start": 89,
                    "end": 94,
                    "id": "MIM:602460\tHGNC:9220\tEnsembl:ENSG00000091010\tBERN:324351702"
        "drug": [],
        "species": []

Am I in the right path to achieve that?
Any help/suggestion is more than welcome!


@Vivian Did you get any success with this issue? I am also in a similar situation.


In order to get something reliable, we deployed this:

Let’s see in the future how we could do differently :slight_smile:


Hi @Vivian in your original issue, it prompts to fine tune the model on a downstream task. If I’m not wrong, we would need labelled data for that. I’m specifically interested in tagging diseases in Pubmed files, not sure how would I be able to fine tune BioBert for this task. Do you have any idea?

Also, regarding BERN (& BERN2), is there a hugging face implementation available? I checked the link you attached & apparently ~70 gb disk space shall be required to be able to use BERN for NER. I’m willing to do these things in google colab. Any idea how should I go about things? Or do you have any experience with NER on biomed text data?

Any help is highly appreciated :slight_smile:



I’m still using BERN with some good results.
I did not find a BERN model with HuggingFace.
Perhaps in the future?

I will take a look at BERN2.
Good luck!

Okay, how are you using it? I mean is it the web api or did you get the code for it running on your system or cloud.

I want to use it for multiple files so web api isn’t the way for me, could you please share how can I get it running on my system?

1 Like

We downloaded the BERN project in order to run it privately.
Definitely, the web API is not tailored for this kind of purpose.

Unfortunately I can’t share anything else than the useful BERN readme.

Perhaps you should ask to a WebDev/DevOps to deploy this solution.


Hello @srishti-hf1110 and @Vivian ! Thank you for posting this discussion as I am working with biobert/bern also and what you’ve written here has helped me find with I need to get stated. I am using multiple documents also(youtube comments) and created some functions to split each document into 5000 characters or less and then run them through the BERN Api.

Feel free to use any of my project by downloading the functions using : pip install biobert-bern

or feel free to use anything from the git repo.

I’ve also posted a more readable website version created with nbdev and quarto. Please let me know if this is useful to anyone!