Are albert-base-v1( and v2) pretrained enough?

Hi all,

I have questions on albert-base-v1 and v2 models uploaded in the huggingface model hub. I’ve checked the MLM loss of albert base models using book corpus dataset and squad context data(which is basically similar to Wikipedia data) based on this example script from transformer repo, in order to validate initial model performance (without additional training). It appears that the average MLM losses are around 2.5(squad context) and 3.2(book corpus) in albert-base-v1, which is much worse than I expected given that those two datasets must have been used for the pertaining albert base model. The value was much worse in v2.

I would like to ask whether albert-base models are trained until the losses converge, or the model has been trained on one or two epochs. Also, it would be great if there is an actual training script and/or MLM loss history on albert pertaining models.


The models on the hub are not trained by HuggingFace (unless explicitly mentioned), so the weights are the original implementation weights, ported/converted to the implementation by HF. Whether or not the model is trained “well enough” is a question for the original authors of the model.

1 Like

Thanks for the reply @BramVanroy! And thanks a lot for correcting me. I was wondering why the weights for sentence order prediction (SOP) classification (or next sentence prediction) weights are not ported/converted, while it is provided by the original implementation? Thanks!

As far as I know the BERT NSP weights are present when you use BertForPreTraining or BertForNextSentencePrediction. But you are right that AlbertForPreTraining does not seem to load the SOP weights, and there is no specific SOP-only model definition. I am not sure why that is the case.

Thanks, @BramVanroy.