Pre-Train BERT from scratch

I want to pre-train BERT from scratch on a domain-specific dataset. How should I go with it? I tried some code online but ran into issues. I have never done pytorch or Tensorflow before so I can’t understand most of the code online

You’ll need two things:

  1. How to train such a model. In case you do have theoretical knowledge in the domain of ML and DL, you can read the paper presenting BERT. You can find the information regards the training regime, the masking probability, tokens they trained on, scheduler, and so on.

  2. The second is how to code it. For that end, huggingface deliver a script which can do that for you.
    The script you need is the run_mlm.py from this repo.
    You can see in the repo a simple example of how to run the script, but you can give it many more parameters for your needs, including training from scratch a given model (for example BERT).
    The script calls the Trainer class along with TrainingArguments class.
    I recommend you understand from the paper how you should train your BERT, then apply it with the script. Most if not all of your needs are already included somehow in the optional arguments for the script.
    Good luck!

2 Likes

Thank you for your suggestion. I see this is for pre-training for MLM task. Could you tell me how to work with NSP task. I am having issues dealing with this. This is an error I get:
RuntimeError: The size of tensor a (763) must match the size of tensor b (512) at non-singleton dimension 1

I tried limiting the max len too by passing argument but did not work/

NSP is the task that introduced with the original paper to capture meaning over long text, but later studies especially RoBERTa, showed that it is not only unnecessary step but also hurts quality of model. So, it might not be good idea to start with this.

I have been working to pre-train BERT-type model from scratch and I found James Briggs tutorial good starting point. Also, MosaicML’s claim to pre-train BERT model under $50 is something to consider.

2 Likes

Thank you for suggestion and explanation. Could you please inform me how to train a pre-trained BERT model for my dataset which is a combination of DNA sequence and methylation information, Token: [A,T,C,G,M,UM] and using k-mer for creating sequences. Can I use the run _mlm.py?

Thank you for suggestion and explanation. Could you please inform me how to train a pre-trained BERT model for my dataset which is a combination of DNA sequence and methylation information, Token: [A,T,C,G,M,UM] and using k-mer for creating sequences. Can I use the run _mlm.py?