Is there a mechanism or feature in RoBERTa limiting the logits difference of its predictions?

We have been using the RoBERTa model to classify descriptions of safety related events occurred at nuclear power plants according to the categories of the related international taxonomy, and we got an interesting insight we would like to share hoping to get some feedback about it:

  1. We prepared a training dataset as follows: << description of safety related event [SEP] definition of category // Label (0 if related or 1 if unrelated) >>

  2. We fine-tuned BERT (next sentence prediction), RoBERTa and GPT-2 (both with the sequence classification head on top) with Adam optimization using the same parameters: β1 = 0.9, β2 = 0.999, eps = 1e-8 and L2 weight decay of 0. Batch size of 24 and 3 epochs. The learning rate warmed up over the first 10% of the total steps to a peak value of 3e-5.

  3. For each of the 138 categories of the taxonomy, we calculated the average logits difference obtained from the predictions of the fine-tuned models over the training dataset and the Matthews Correlation Coefficient to compare how accurate were the models classifying each of the different categories (again over the same training dataset).

While plotting the relation between these two parameters for each of the categories, we obtained these graphs:


We fail to understand why the behaviour of RoBERTa (with the logits difference average for each category never going beyond 6) looks so different to BERT and GPT-2 . We would therefore appreciate any considerations, comments or thoughts you might have about it that would help us understanding it.