HuggingFace Trainer - Eval loss abruptly goes up at the last step of training

Hello. I’m using HuggingFace Trainer together with TensorBoard to pretrain transformers and visualize the loss plots (TensorBoard reads the information from the runs subfolder created by Trainer).
For example, I used it to train a little instance of BertForMaskedLM with two layers and two heads on each (also known as BERT tiny) with a huge gradient accumulation and a little evaluation after every 100 steps.
But the plots of the train and eval loss look weird:

At first, the eval loss is smaller than the train loss. I suppose that maybe for building this plot the model automatically goes to .eval() mode, so dropout turns off, and that’s why the loss is smaller? (please, correct me if I’m wrong).
At second, there is a little piece of eval plot abruptly going up at the end of the training, while train plot is totally okay. This problem persists not only for this particular model but for bigger models as well, and I don’t have any suggestions what’s going on here.

Here is the code for setting up and launching the Trainer:

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=True, mlm_probability=0.15
)

training_args = TrainingArguments(
    output_dir=output_dir,
    overwrite_output_dir=True,
    num_train_epochs=4,
    per_gpu_train_batch_size=10,
    warmup_ratio=0.06,
    learning_rate=0.001,
    adam_beta1=0.9,
    adam_beta2=0.98,
    adam_epsilon=1e-6,
    weight_decay=0.001,
    gradient_accumulation_steps=400,
    save_steps=1000,
    save_total_limit=10,
    evaluation_strategy="steps",
    eval_steps=100,
    logging_steps=100,
    prediction_loss_only=True,
)

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=tokenized_dataset,
    eval_dataset=tokenized_dataset_eval
)

trainer.train()

After it, I make a little eval by hand using

trainer.evaluate(tokenized_dataset[:10000]['input_ids'])

and

trainer.evaluate(tokenized_dataset_eval[:10000]['input_ids'])

Usually evaluation results are similar for train and eval subsets.
For example, for BERT tiny, I get 3.852 and 3.843 correspondingly. Both losses are close to the loss on the eval plot after this weird going-up part.
At the same time minimal eval loss before this part is 3.57, which is significantly smaller. I would be happy to keep the model, which has loss equal to 3.57, instead of 3.84-3.85.

I will be very grateful if someone will explain why the Trainer makes the eval loss going up at the end and how to avoid that. Thank you!

I tried to check the model from the last checkpoint before eval loss went up, and it turned out that it has loss, similar to the loss after going-up part (~ 3.8). Can it be that Trainer didn’t show the real loss all the time, but instead showed some smaller number?
Or maybe something wrong is happening at the moment of saving the weights of the model?
Maybe there is some bug, connected to very huge gradient accumulation (400 steps of accumulation)?