How to set up Trainer for a regression?

Hello,

I am aware that I can run a regression model using float target values and num_labels=1 in a classification head like below:

model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english", 
                                                           num_labels=1,
                                                           ignore_mismatched_sizes=True)

The problem is that right now I am merely adapting the Trainer specs for classification and during training I can see an accuracy metric where rmse or r-squared would be more appropriate.

See the accuracy score below on the validation data:

from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

batch_size = 32

args = TrainingArguments(
    evaluation_strategy = "epoch",
    save_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=3,
    report_to="none",
    weight_decay=0.01,
     output_dir='/content/drive/MyDrive/kaggle/',
    metric_for_best_model='accuracy')

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return metric.compute(predictions=predictions, references=labels)

trainer = Trainer(
    model,
    args,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['test'],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

which gives

Epoch Training Loss Validation Loss Accuracy
1 0.507300 0.499625 0.503853
2 0.466000 0.495724 0.503853

Which arguments in trainer should I use to I get rmse or r-squared instead? I assume the loss (that is minimized) is already the mean squared error (maybe I am wrong?)

Thanks!

2 Likes

You can use the RMSE metric, like so:

from sklearn.metrics import mean_squared_error

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    rmse = mean_squared_error(labels, predictions, squared=False)
    return {"rmse": rmse}

Sources:

1 Like

ahhh!! thanks @nielsr !! and am I correct to assume that the loss being minimized here (0.5 and 0.466 in the example) is the mean squared error as well?

1 Like

Yes, you can see that in the source code here.

Note that you can also set the problem_type of the model to “regression” (which is equivalent to setting num_labels=1).

3 Likes

Using HF Evaluate

from evaluate import load
metric = load('mse')

metric.compute(predictions=predictions, references=labels, squared=False)