Pretraining of BertForMaskedLM - What CELoss should I aim for?

ibj1998 · February 9, 2025, 6:28pm

Hi,
I am pretraining the foundational model Geneformer (ctheodoris/Geneformer · Hugging Face) from scratch. The model is based on BertForMaskedLM, and the loss function used is cross-entropy loss. I get the following loss curve after pretraining on approx. 500,000 samples.

In the paper “Transfer learning enables predictions in network biology” (Theodoris et al, 2023), Geneformer is pretrained on approx. 30 mio. samples. I am a beginner in the field of deep learning and have not pretrained a foundational model from scratch before. My question is therefore: What values for the cross-entropy loss are considered bad, and what values are considered good? Does it make sense to go on with hyperparameter optimization, or do I simply have too few samples to pretrain the model properly?

Currently, I use
max learning rate: 1 × 10–3
learning scheduler: linear with warmup
optimizer: Adam with weight decay fix
warmup steps: 10,000
weight decay, 0.001
batch size: 12
and 3 training epochs.

Best,

Topic		Replies	Views
I used a trainer to pretraining a BertForMaskedLM model, but the training loss always be zero 🤗Transformers	0	234	August 31, 2023
Pre-training BERT Models	1	381	May 21, 2024
One question is about the pretrain method in Transformer packge ？ 🤗Transformers	1	202	March 19, 2025
Loss behaviour for bert fine-tuning on QNLI Models	3	4428	October 15, 2021
BertForMaskedLM training from scratch not converging 🤗Transformers	0	253	July 18, 2023

Pretraining of BertForMaskedLM - What CELoss should I aim for?

Related topics