Is there a mechanism or feature in RoBERTa limiting the logits difference of its predictions?

tanarro · August 6, 2021, 1:25pm

We have been using the RoBERTa model to classify descriptions of safety related events occurred at nuclear power plants according to the categories of the related international taxonomy, and we got an interesting insight we would like to share hoping to get some feedback about it:

We prepared a training dataset as follows: << description of safety related event [SEP] definition of category // Label (0 if related or 1 if unrelated) >>
We fine-tuned BERT (next sentence prediction), RoBERTa and GPT-2 (both with the sequence classification head on top) with Adam optimization using the same parameters: β1 = 0.9, β2 = 0.999, eps = 1e-8 and L2 weight decay of 0. Batch size of 24 and 3 epochs. The learning rate warmed up over the first 10% of the total steps to a peak value of 3e-5.
For each of the 138 categories of the taxonomy, we calculated the average logits difference obtained from the predictions of the fine-tuned models over the training dataset and the Matthews Correlation Coefficient to compare how accurate were the models classifying each of the different categories (again over the same training dataset).

While plotting the relation between these two parameters for each of the categories, we obtained these graphs:

plot

We fail to understand why the behaviour of RoBERTa (with the logits difference average for each category never going beyond 6) looks so different to BERT and GPT-2 . We would therefore appreciate any considerations, comments or thoughts you might have about it that would help us understanding it.

Topic		Replies	Views
Does anyone else observer RoBERTa fine-tuning instability? 🤗Transformers	8	3114	April 20, 2023
Why Is My Fine-Tuned RoBERTa (Text classification) Model Only Predicting One Category/Class? Beginners	4	132	March 24, 2025
Trying to understand XForSequenceClassification heads Intermediate	8	1321	September 24, 2020
Fine-tuning Bert/Roberta for multi-label sentiment analysis Beginners	0	1591	November 8, 2021
Fine-tuned pre-trained Roberta model on different labels 🤗Transformers	0	633	April 7, 2022

Is there a mechanism or feature in RoBERTa limiting the logits difference of its predictions?

Related topics