Pre-training & fine-tuning BERT on specific domain with custom dataset

Hallo,

I need some help with training BERT and thought maybe I can ask you here…

I am trying to train a BERT model for a specific domain, similar to BioBERT, but for some other field.
So, for achieving my plans, I run the run_mlm.py script which I found on transformers/examples/pytorch/language-modeling at master · huggingface/transformers · GitHub for bert-base-uncased with some custom dataset, which is a large txt-file containing my corpus.
In the next step I wanted to fine-tune my model on the NER tasks using run_ner.py script I found on the same GitHub in: examples/pytorch/token-classification.
For a small example dataset the fine-tuning works, but if I use my whole dataset I get the following error: Map method to tokenize raises index error - #10 by rainman020.
Maybe you can tell me if my approach is correct?
Another point where I am struggling is that when using run_ner.py I get the warning that I should train this model on a down-stream task. But I thought this is what I am doing using this script to fine-tune on NER. Do I have to do some extra steps in the pre-training phase?

I googled a lot but it is still not 100% clear for me. If you could help me I would be very glad.
Thank you!

1 Like

Hi there!

I’m using your question to ask one related to the run_ner.py script (maybe you could help on this one, since it is extremely basic)! I’m trying to build an extractive summariser using this latter.

I’m starting with all this so I am at level 0, and am trying to understand how to fit this script to my own data? I have my train dev and test datasets tokenized and labelled, in csv formats. But when i run the command on my data, the following appears. (the run_ner.py script is loaded to the environment)

Any guidance on how to make this work…? Thank you!!

You probably just need to install from source before running the script.

!pip install git+https://github.com/huggingface/transformers.git

Hi again,

I trained my model and fine-tuned it on a custom dataset for NER, as stated in my first post.
But my results are poor. F1 for bert-base-uncased is 0.619 and my own model on the same task has F1 = 0.0667. How is it possible that my model is much worse than the base bert? Do you have any ideas?

Thanks and Regards

Hi Greg, sorry to intrude on your question. I am new to huggingface and struggling to use run_ner.py. I don’t think I understand the formatting of the csv train and validation files. How are they supposed to look like??