Pre-training & fine-tuning BERT on specific domain with custom dataset

rainman020 · June 9, 2021, 1:43pm

Hallo,

I need some help with training BERT and thought maybe I can ask you here…

I am trying to train a BERT model for a specific domain, similar to BioBERT, but for some other field.
So, for achieving my plans, I run the run_mlm.py script which I found on transformers/examples/pytorch/language-modeling at master · huggingface/transformers · GitHub for bert-base-uncased with some custom dataset, which is a large txt-file containing my corpus.
In the next step I wanted to fine-tune my model on the NER tasks using run_ner.py script I found on the same GitHub in: examples/pytorch/token-classification.
For a small example dataset the fine-tuning works, but if I use my whole dataset I get the following error: Map method to tokenize raises index error - #10 by rainman020.
Maybe you can tell me if my approach is correct?
Another point where I am struggling is that when using run_ner.py I get the warning that I should train this model on a down-stream task. But I thought this is what I am doing using this script to fine-tune on NER. Do I have to do some extra steps in the pre-training phase?

I googled a lot but it is still not 100% clear for me. If you could help me I would be very glad.
Thank you!

Greg1901 · June 14, 2021, 3:01pm

Hi there!

I’m using your question to ask one related to the run_ner.py script (maybe you could help on this one, since it is extremely basic)! I’m trying to build an extractive summariser using this latter.

I’m starting with all this so I am at level 0, and am trying to understand how to fit this script to my own data? I have my train dev and test datasets tokenized and labelled, in csv formats. But when i run the command on my data, the following appears. (the run_ner.py script is loaded to the environment)

Any guidance on how to make this work…? Thank you!!

nbroad · June 16, 2021, 8:28pm

You probably just need to install from source before running the script.

!pip install git+https://github.com/huggingface/transformers.git

rainman020 · July 4, 2021, 2:42pm

Hi again,

I trained my model and fine-tuned it on a custom dataset for NER, as stated in my first post.
But my results are poor. F1 for bert-base-uncased is 0.619 and my own model on the same task has F1 = 0.0667. How is it possible that my model is much worse than the base bert? Do you have any ideas?

Thanks and Regards

IreneCrepax · August 10, 2021, 2:56pm

Hi Greg, sorry to intrude on your question. I am new to huggingface and struggling to use run_ner.py. I don’t think I understand the formatting of the csv train and validation files. How are they supposed to look like??

Topic		Replies	Views
Fine-tuning BERT Model on domain specific language and for classification 🤗Transformers	7	8433	November 14, 2024
How to "further pretrain" a tokenizer (do I need to do so?) 🤗Tokenizers	5	4396	February 20, 2022
Fine Tuning BERT model on custom dataset 🤗Transformers	3	1187	January 27, 2022
Fine-tuning a locally saved model on NER task 🤗Transformers	2	1219	July 21, 2022
Tutorial: Fine-tuning with custom datasets – sentiment, NER, and question answering 🤗Transformers	19	12848	February 12, 2024

Pre-training & fine-tuning BERT on specific domain with custom dataset

Related topics