I trained a few other Bert models and it seems that all models need a few steps (up to 50) till the train loss becomes lower compared to the validation loss. Even with different random states etc. Do you think I do not really have to worry? I mean after those “starting problems” the losses behave normal/healthy for my taste (0.3 vs 0.6 when finished with early stopping)