Hello. Iâ€™m using HuggingFace Trainer together with TensorBoard to pretrain transformers and visualize the loss plots (TensorBoard reads the information from the **runs** subfolder created by Trainer).

For example, I used it to train a little instance of BertForMaskedLM with two layers and two heads on each (also known as BERT tiny) with a huge gradient accumulation and a little evaluation after every 100 steps.

But the plots of the train and eval loss look weird:

At first, the eval loss is smaller than the train loss. I suppose that maybe for building this plot the model automatically goes to .eval() mode, so dropout turns off, and thatâ€™s why the loss is smaller? (please, correct me if Iâ€™m wrong).

At second, there is a little piece of eval plot abruptly going up at the end of the training, while train plot is totally okay. This problem persists not only for this particular model but for bigger models as well, and I donâ€™t have any suggestions whatâ€™s going on here.

Here is the code for setting up and launching the Trainer:

```
data_collator = DataCollatorForLanguageModeling(
tokenizer=tokenizer, mlm=True, mlm_probability=0.15
)
training_args = TrainingArguments(
output_dir=output_dir,
overwrite_output_dir=True,
num_train_epochs=4,
per_gpu_train_batch_size=10,
warmup_ratio=0.06,
learning_rate=0.001,
adam_beta1=0.9,
adam_beta2=0.98,
adam_epsilon=1e-6,
weight_decay=0.001,
gradient_accumulation_steps=400,
save_steps=1000,
save_total_limit=10,
evaluation_strategy="steps",
eval_steps=100,
logging_steps=100,
prediction_loss_only=True,
)
trainer = Trainer(
model=model,
args=training_args,
data_collator=data_collator,
train_dataset=tokenized_dataset,
eval_dataset=tokenized_dataset_eval
)
trainer.train()
```

After it, I make a little eval by hand using

```
trainer.evaluate(tokenized_dataset[:10000]['input_ids'])
```

and

```
trainer.evaluate(tokenized_dataset_eval[:10000]['input_ids'])
```

Usually evaluation results are similar for train and eval subsets.

For example, for BERT tiny, I get 3.852 and 3.843 correspondingly. Both losses are close to the loss on the eval plot after this weird going-up part.

At the same time minimal eval loss before this part is 3.57, which is significantly smaller. I would be happy to keep the model, which has loss equal to 3.57, instead of 3.84-3.85.

I will be very grateful if someone will explain why the Trainer makes the eval loss going up at the end and how to avoid that. Thank you!