I need some help with training BERT and thought maybe I can ask you here…
I am trying to train a BERT model for a specific domain, similar to BioBERT, but for some other field.
So, for achieving my plans, I run the run_mlm.py script which I found on transformers/examples/pytorch/language-modeling at master · huggingface/transformers · GitHub for bert-base-uncased with some custom dataset, which is a large txt-file containing my corpus.
In the next step I wanted to fine-tune my model on the NER tasks using run_ner.py script I found on the same GitHub in: examples/pytorch/token-classification.
For a small example dataset the fine-tuning works, but if I use my whole dataset I get the following error: Map method to tokenize raises index error - #10 by rainman020.
Maybe you can tell me if my approach is correct?
Another point where I am struggling is that when using run_ner.py I get the warning that I should train this model on a down-stream task. But I thought this is what I am doing using this script to fine-tune on NER. Do I have to do some extra steps in the pre-training phase?
I googled a lot but it is still not 100% clear for me. If you could help me I would be very glad.