Accuracy of Masked LM training

Hello everyone,

as I have already told you I am new to the transformers, so I am having issues with probably some basic stuff. I am training a Roberta model from scratch and I am trying to compute some of its metrics. I succeeded in printing those but I have some questions regarding them. So,

  1. Do we have Tp (True positives) and Tn (True negative) and of course Fp (False Positive) and Fn (False negative) predictions in MLM?
  2. Extending the previous question, could we use f1, precision and recall scores in MLM?
    I tried to compute all of the aforementioned scores, the accuracy included, during my training and I am getting the same result for all the scores. To do that I am using the function :
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=-1)
    predictions = predictions.flatten()
    labels = labels.flatten()
    accuracy = accuracy_score(y_true=labels, y_pred=predictions)
    recall = recall_score(y_true=labels, y_pred=predictions, average='micro')
    precision = precision_score(y_true=labels, y_pred=predictions, average='micro')
    f1 = f1_score(y_true=labels, y_pred=predictions, average='micro')
    return {"accuracy": accuracy, "precision": precision, "recall": recall, "f1": f1}

So 3 things are strange for me here:

  1. I tried to run my code for just a batch of data to check if it is working. And I saw that the logits (predicted values) were also predicting the unmasked tokens. And that is why at first all values are stuck to 0.0. Is this normal?
  2. I am doing the validation after each epoch using the whole validation set at once. But the sklearn functions could not work with multiple sequences and I used flatten. It seems to work, ‘algorithmically speaking’. But this had me thinking if I am doing something wrong with the way I am batching my data. Since I am working on proteins I have a list of dictionaries with each dictionary having the standard keys: ‘input_ids’, ‘labels’, ‘attention_mask’. Am I looking for trouble here?
  3. Finally, something that struck me as odd was that there is no way for the metrics function to know in advance the masked tokens and calculate the metrics with respect to those masked tokens. So, in my case - and all the cases I have seen so far - I am computing the metrics for the whole sequence and not the masked tokens.
    Sorry for the long message :slight_smile: But I really tried to find some answers to these questions on my own.

Thank you