Training BERT from scratch (MLM+NSP) on a new domain

rish · November 15, 2020, 11:01pm

Hi, I have been trying to train BERT from scratch using the wonderful hugging face library. I am referring to the Language modeling tutorial and have made changes to it for the BERT. As I am running on a completely new domain I have trained my own tokenizer, which trains fine. However, I run into the following error during the training of the model.

)

/usr/local/lib/python3.6/dist-packages/transformers/trainer.py in train(self, model_path, trial)
710 self.state.is_world_process_zero = self.is_world_process_zero()
711
–> 712 tr_loss = torch.tensor(0.0).to(self.args.device)
713 self._logging_loss_scalar = 0
714 self._total_flos = self.state.total_flos

RuntimeError: CUDA error: device-side assert triggered

From what I understand this is related to some tensor mismatch, however, I unable to resolve this and can’t understand where I am going wrong during the model building. I would really appreciate your help with this.

I have attached the colab notebook for reference

(https://colab.research.google.com/drive/12NHfXeUBo7RBl3Kffa-715i-Zpd0MOzP?usp=sharing).

I am using BertForPreTraining, TextDatasetForNextSentencePrediction, DataCollatorForNextSentencePrediction and BertFasttokeizer.

Environment:
tokenizers: 0.9.2
transformers: 3.4.0
torch: 1.7.0
CUDA: 10.1

sgugger · November 16, 2020, 2:02pm

I haven’t run your notebook, but the first thing I’d double check looking at it is that your vocabulary has indeed 50,000 tokens and not more.

rish · November 16, 2020, 8:29pm

Hi @sgugger, yes the vocabulary has 50000 tokens.

rgwatwormhill · November 17, 2020, 2:49pm

Hi, I’m not an expert, but wouldn’t you need to do the MLM training before the NSP training?
(Does that happen automatically as part of the tokenizer training?)

Have you tried reducing your batch-size? It depends on your text lengths, but 64 might be too big. Finetuning Bert using Colab GPU I found that maxlen x batchsize = 8192 should fit, eg 512 x 16, 128 x 64 - too big causes OutOfMemory.

By the way, is the save_steps correct?

rish · November 17, 2020, 9:24pm

Hi @rgwatwormhill, if I read the documentation from huggingface ,
BertForPreTraining provides two training heads mlm + nsp.(https://huggingface.co/transformers/model_doc/bert.html#transformers.BertForPreTraining)

Also according to this PR on github (https://github.com/huggingface/transformers/pull/6644), the DatasetforNSP and DatacollatorforNSP, both of them take care for MLM +NSP data formatiing. I expected using both (DatasetforNSP and DatacollatorforNSP) with BertForPreTraining should be fine but I may be wrong. Also for the batch size you are correct, it may be too large for the training, however, the error which I am getting has some other problem, and does not successfully go to the training.

rgwatwormhill · November 17, 2020, 10:21pm

Why does your notebook config say

vocab_size=50_000,

rish · November 17, 2020, 10:46pm

I want the BERT model which I am training, to have 50000 vocab size. This vocab is generated using bwpt tokenizer, and the vocab is available at path Bert/voc-vocab.txt.

rgwatwormhill · November 17, 2020, 10:49pm

Sorry, I didn’t make my self clear:

it currently says 50_000
I think it should say 50000

rish · November 17, 2020, 11:04pm

I think both should be fine according to pep coding style

rgwatwormhill · November 17, 2020, 11:55pm

Oh yes, thank you (I didn’t know that!)

kumarme072 · February 2, 2024, 5:27pm

Is there any limit for vocab size ??
I don’t think so … if there limit please explain why can’t we increase the vocab size more than 50k ?
What will happen if we do so ???

Topic		Replies	Views
Pre-Train BERT from scratch 🤗Transformers	5	15342	May 30, 2023
NSP + WWM raises error when training BertForPreTraining Beginners	0	630	May 2, 2021
How to train BERT from scratch on a new domain for both MLM and NSP? Models	2	2288	February 6, 2021
Pre-Train BERT (from scratch) Research	43	18981	June 27, 2022
[HELP] RuntimeError: CUDA error - when training my model? Beginners	2	2512	August 24, 2021

Training BERT from scratch (MLM+NSP) on a new domain

Related topics