Using hyperparameter-search in Trainer

The strange results are actually results of the inability of the network to learn anything because of the learning rate, which is very high in your cases as you can see.

Transformers need a much lower finetuning learning rate (e.g 5e-5)