Training BERT from scratch (MLM+NSP) on a new domain

Hi, I have been trying to train BERT from scratch using the wonderful hugging face library. I am referring to the Language modeling tutorial and have made changes to it for the BERT. As I am running on a completely new domain I have trained my own tokenizer, which trains fine. However, I run into the following error during the training of the model.


/usr/local/lib/python3.6/dist-packages/transformers/ in train(self, model_path, trial)
710 self.state.is_world_process_zero = self.is_world_process_zero()
–> 712 tr_loss = torch.tensor(0.0).to(self.args.device)
713 self._logging_loss_scalar = 0
714 self._total_flos = self.state.total_flos

RuntimeError: CUDA error: device-side assert triggered

From what I understand this is related to some tensor mismatch, however, I unable to resolve this and can’t understand where I am going wrong during the model building. I would really appreciate your help with this.

I have attached the colab notebook for reference


I am using BertForPreTraining, TextDatasetForNextSentencePrediction, DataCollatorForNextSentencePrediction and BertFasttokeizer.

tokenizers: 0.9.2
transformers: 3.4.0
torch: 1.7.0
CUDA: 10.1

I haven’t run your notebook, but the first thing I’d double check looking at it is that your vocabulary has indeed 50,000 tokens and not more.

Hi @sgugger, yes the vocabulary has 50000 tokens.

Hi, I’m not an expert, but wouldn’t you need to do the MLM training before the NSP training?
(Does that happen automatically as part of the tokenizer training?)

Have you tried reducing your batch-size? It depends on your text lengths, but 64 might be too big. Finetuning Bert using Colab GPU I found that maxlen x batchsize = 8192 should fit, eg 512 x 16, 128 x 64 - too big causes OutOfMemory.

By the way, is the save_steps correct?

Hi @rgwatwormhill, if I read the documentation from huggingface ,
BertForPreTraining provides two training heads mlm + nsp.(

Also according to this PR on github (, the DatasetforNSP and DatacollatorforNSP, both of them take care for MLM +NSP data formatiing. I expected using both (DatasetforNSP and DatacollatorforNSP) with BertForPreTraining should be fine but I may be wrong. Also for the batch size you are correct, it may be too large for the training, however, the error which I am getting has some other problem, and does not successfully go to the training. :grinning:

Why does your notebook config say


I want the BERT model which I am training, to have 50000 vocab size. This vocab is generated using bwpt tokenizer, and the vocab is available at path Bert/voc-vocab.txt.

Sorry, I didn’t make my self clear:

it currently says 50_000
I think it should say 50000

I think both should be fine according to pep coding style

Oh yes, thank you (I didn’t know that!)