Pre-Train BERT from scratch

rishabhstha · December 23, 2022, 3:33am

I want to pre-train BERT from scratch on a domain-specific dataset. How should I go with it? I tried some code online but ran into issues. I have never done pytorch or Tensorflow before so I can’t understand most of the code online

IdoAmit198 · December 23, 2022, 6:12am

You’ll need two things:

How to train such a model. In case you do have theoretical knowledge in the domain of ML and DL, you can read the paper presenting BERT. You can find the information regards the training regime, the masking probability, tokens they trained on, scheduler, and so on.
The second is how to code it. For that end, huggingface deliver a script which can do that for you.
The script you need is the run_mlm.py from this repo.
You can see in the repo a simple example of how to run the script, but you can give it many more parameters for your needs, including training from scratch a given model (for example BERT).
The script calls the Trainer class along with TrainingArguments class.
I recommend you understand from the paper how you should train your BERT, then apply it with the script. Most if not all of your needs are already included somehow in the optional arguments for the script.
Good luck!

rishabhstha · January 2, 2023, 6:39pm

Thank you for your suggestion. I see this is for pre-training for MLM task. Could you tell me how to work with NSP task. I am having issues dealing with this. This is an error I get:
RuntimeError: The size of tensor a (763) must match the size of tensor b (512) at non-singleton dimension 1

I tried limiting the max len too by passing argument but did not work/

murodbek · May 22, 2023, 1:14am

NSP is the task that introduced with the original paper to capture meaning over long text, but later studies especially RoBERTa, showed that it is not only unnecessary step but also hurts quality of model. So, it might not be good idea to start with this.

I have been working to pre-train BERT-type model from scratch and I found James Briggs tutorial good starting point. Also, MosaicML’s claim to pre-train BERT model under $50 is something to consider.

YassiMaryam · May 27, 2023, 9:32pm

Thank you for suggestion and explanation. Could you please inform me how to train a pre-trained BERT model for my dataset which is a combination of DNA sequence and methylation information, Token: [A,T,C,G,M,UM] and using k-mer for creating sequences. Can I use the run _mlm.py?

YassiMaryam · May 30, 2023, 3:55am

Thank you for suggestion and explanation. Could you please inform me how to train a pre-trained BERT model for my dataset which is a combination of DNA sequence and methylation information, Token: [A,T,C,G,M,UM] and using k-mer for creating sequences. Can I use the run _mlm.py?

Topic		Replies	Views
Pre-Train BERT (from scratch) Research	43	18999	June 27, 2022
How to train BERT from scratch on a new domain for both MLM and NSP? Models	2	2299	February 6, 2021
Training BERT from scratch (MLM+NSP) on a new domain 🤗Transformers	10	6127	February 2, 2024
Continual pre-training from an initial checkpoint with MLM and NSP Models	4	4288	September 8, 2021
Train bert from scratch using run_mlm.py Beginners	0	805	March 25, 2022

Pre-Train BERT from scratch

Related topics