How to finetune cola dataset using trainsformer and pytorch?

I am trying to use RobertaModelForSequenceClassification as my backbone and pytorch.DistributedDataParallel to train dataparallel.
My qusetions are as follows.

  1. metric Matthews correlation is used for training and evaluating or just evaluating? Is the loss function of cola dataset nn.Crossentrophy or Matthews correlation?

  2. what should I input to the model? Is these code below ok?

  train_dataset.set_format(type='torch', columns=['input_ids','labels','attention_mask'])
  val_dataset.set_format(type='torch', columns=['input_ids','labels','attention_mask'])
  1. In robertaforsentenceclassification source code
    transformers/modeling_roberta.py at 198c335d219a5eb4d3f124fdd1ce1a9cd9f78a9b · huggingface/transformers · GitHub
    are all attention_mask are the same after input to each layer_module in the loop of robertaencoder?
    If you could give me some pyotrch&huggingface code without using trainer in huggingface, that would be so great!