I want to pre-train BERT from scratch on a domain-specific dataset. How should I go with it? I tried some code online but ran into issues. I have never done pytorch or Tensorflow before so I can’t understand most of the code online
You’ll need two things:
How to train such a model. In case you do have theoretical knowledge in the domain of ML and DL, you can read the paper presenting BERT. You can find the information regards the training regime, the masking probability, tokens they trained on, scheduler, and so on.
The second is how to code it. For that end, huggingface deliver a script which can do that for you.
The script you need is the
run_mlm.pyfrom this repo.
You can see in the repo a simple example of how to run the script, but you can give it many more parameters for your needs, including training from scratch a given model (for example BERT).
The script calls the Trainer class along with TrainingArguments class.
I recommend you understand from the paper how you should train your BERT, then apply it with the script. Most if not all of your needs are already included somehow in the optional arguments for the script.
Thank you for your suggestion. I see this is for pre-training for MLM task. Could you tell me how to work with NSP task. I am having issues dealing with this. This is an error I get:
RuntimeError: The size of tensor a (763) must match the size of tensor b (512) at non-singleton dimension 1
I tried limiting the max len too by passing argument but did not work/