Albert pre-train convergence problem

My abert pre-train from scratch model can’t converge to 0 in even using wiki-Text2.

  • The model training loss converged at 6.6 when using AlbertForMaskedLM as model class
  • negative training loss when using AlbertForPretrain as model class

notice: I was deliberately set the eval dataset the same as training set for checking training loss at last run.
I also raised a issue here: