I pre-trained BERT from scratch on a domain-specific custom dataset using BertForPreTraining on both MLM and NSP objectives. I trained my own custom tokenizer. I trained up to 8 epochs because the loss started to remain pretty consistent.
My loss was around 2.3. Is that normal for pre-training BERT?
Another question, I also pre-trained another model starting from the bert-base-uncased model. I also used my custom tokenizer, would that be fine since the vocab list is different? The loss is decreasing and doing almost similar as previous model. Is my model really training?