Hi, I have been trying to train BERT from scratch using the wonderful hugging face library. I am referring to the Language modeling tutorial and have made changes to it for the BERT. As I am running on a completely new domain I have trained my own tokenizer, which trains fine. However, I run into the following error during the training of the model.
RuntimeError: CUDA error: device-side assert triggered
From what I understand this is related to some tensor mismatch, however, I unable to resolve this and can’t understand where I am going wrong during the model building. I would really appreciate your help with this.
Hi, I’m not an expert, but wouldn’t you need to do the MLM training before the NSP training?
(Does that happen automatically as part of the tokenizer training?)
Have you tried reducing your batch-size? It depends on your text lengths, but 64 might be too big. Finetuning Bert using Colab GPU I found that maxlen x batchsize = 8192 should fit, eg 512 x 16, 128 x 64 - too big causes OutOfMemory.
Also according to this PR on github (https://github.com/huggingface/transformers/pull/6644), the DatasetforNSP and DatacollatorforNSP, both of them take care for MLM +NSP data formatiing. I expected using both (DatasetforNSP and DatacollatorforNSP) with BertForPreTraining should be fine but I may be wrong. Also for the batch size you are correct, it may be too large for the training, however, the error which I am getting has some other problem, and does not successfully go to the training.
I want the BERT model which I am training, to have 50000 vocab size. This vocab is generated using bwpt tokenizer, and the vocab is available at path Bert/voc-vocab.txt.
Is there any limit for vocab size ??
I don’t think so … if there limit please explain why can’t we increase the vocab size more than 50k ?
What will happen if we do so ???