Pretraining of BertForMaskedLM - What CELoss should I aim for?

Hi,
I am pretraining the foundational model Geneformer (ctheodoris/Geneformer · Hugging Face) from scratch. The model is based on BertForMaskedLM, and the loss function used is cross-entropy loss. I get the following loss curve after pretraining on approx. 500,000 samples.

In the paper “Transfer learning enables predictions in network biology” (Theodoris et al, 2023), Geneformer is pretrained on approx. 30 mio. samples. I am a beginner in the field of deep learning and have not pretrained a foundational model from scratch before. My question is therefore: What values for the cross-entropy loss are considered bad, and what values are considered good? Does it make sense to go on with hyperparameter optimization, or do I simply have too few samples to pretrain the model properly?

Currently, I use
max learning rate: 1 × 10–3
learning scheduler: linear with warmup
optimizer: Adam with weight decay fix
warmup steps: 10,000
weight decay, 0.001
batch size: 12
and 3 training epochs.

Best,

1 Like