I need some help with training BERT and thought maybe I can ask you here…
I am trying to train a BERT model for a specific domain, similar to BioBERT, but for some other field.
So, for achieving my plans, I run the run_mlm.py script which I found on transformers/examples/pytorch/language-modeling at master · huggingface/transformers · GitHub for bert-base-uncased with some custom dataset, which is a large txt-file containing my corpus.
In the next step I wanted to fine-tune my model on the NER tasks using run_ner.py script I found on the same GitHub in: examples/pytorch/token-classification.
For a small example dataset the fine-tuning works, but if I use my whole dataset I get the following error: Map method to tokenize raises index error - #10 by rainman020.
Maybe you can tell me if my approach is correct?
Another point where I am struggling is that when using run_ner.py I get the warning that I should train this model on a down-stream task. But I thought this is what I am doing using this script to fine-tune on NER. Do I have to do some extra steps in the pre-training phase?
I googled a lot but it is still not 100% clear for me. If you could help me I would be very glad.
Thank you!
I’m using your question to ask one related to the run_ner.py script (maybe you could help on this one, since it is extremely basic)! I’m trying to build an extractive summariser using this latter.
I’m starting with all this so I am at level 0, and am trying to understand how to fit this script to my own data? I have my train dev and test datasets tokenized and labelled, in csv formats. But when i run the command on my data, the following appears. (the run_ner.py script is loaded to the environment)
I trained my model and fine-tuned it on a custom dataset for NER, as stated in my first post.
But my results are poor. F1 for bert-base-uncased is 0.619 and my own model on the same task has F1 = 0.0667. How is it possible that my model is much worse than the base bert? Do you have any ideas?
Hi Greg, sorry to intrude on your question. I am new to huggingface and struggling to use run_ner.py. I don’t think I understand the formatting of the csv train and validation files. How are they supposed to look like??