Evaluating Finetuned BERT Model for Sequence Classification

Python 3.7.6
Transformers 4.4.2
Pytorch 1.8.0

Hi HF Community!

I would like to finetune BERT for sequence classification on some training data I have and also evaluate the resulting model. I am using the Trainer class to do the training and am a little confused on what the evaluation is doing. Below is my code:

import torch
from torch.utils.data import Dataset
from transformers import BertForSequenceClassification, BertTokenizer, Trainer, TrainingArguments
import pandas as pd

class MyDataset(Dataset):
    def __init__(self, csv_file: str):
        self.df = pd.read_csv(csv_file, encoding='ISO-8859-1')
        self.tokenizer = BertTokenizer.from_pretrained("bert-base-uncased", padding_side='right', local_files_only=True)
        self.label_list = self.df['label'].value_counts().keys().to_list()

    def __len__(self) -> int:
        return len(self.df)

    def __getitem__(self, idx: int) -> str:
        if torch.is_tensor(idx):
            idx = idx.tolist()

        text = self.df.iloc[idx, 1]
        tmp_label = self.df.iloc[idx, 3]
        if tmp_label != 'label_a':
            label = 1
            label = 0
        return (text, label)

def data_collator(self, dataset_samples_list):
    tokenizer = BertTokenizer.from_pretrained("bert-base-uncased", padding_side='right', local_files_only=True)
    examples = [example[0] for example in dataset_samples_list]
    encoded_results = tokenizer(examples, padding=True, truncation=True, return_tensors='pt',

    batch = {}
    batch['input_ids'] = torch.stack([result for result in encoded_results['input_ids']])
    batch['attention_mask'] = torch.stack([result for result in encoded_results['attention_mask']])
    batch['labels'] = torch.stack([torch.tensor(example[1]) for example in dataset_samples_list])
    return batch

train_data_obj = MyDataset('/path/to/train/data.csv')
eval_data_obj = MyDataset('/path/to/eval/data.csv')
model = BertForSequenceClassification.from_pretrained("bert-base-uncased")

training_args = TrainingArguments(

trainer = Trainer(


As I understand, once trainer.train() is called, after each epoch the model will be evaluated on the dataset from eval_data_obj and those results will be displayed. After the training is done and the model is saved using trainer.save_model("/path/to/model/save/dir"), trainer.evaluate() will evaluate the saved model on the eval_data_obj and return a dict containing the evaluation loss. Are there other metrics like accuracy that are included in this dict by default? Thank you in advance for your help!

If you want other metrics, you have to indicate that to the Trainer by passing a compute_metrics function. See for instance our official GLUE example or the corresponding notebook.

@sgugger Thank you for the reply, it worked perfectly!
One quick follow up question. If I have finetuned a model and saved it off just after training, what is the best way to load that model and evaluate it on a test set?

You can call Trainer.evaluate on any dataset you want, so just reload it and pass it to Trainer the same way as during training, then run that method.

I see. So I could then specify the location of the newly finetuned model in Trainer, load the eval dataset, pass the eval dataset to Trainer, then run Trainer.evaluate? Just want to make sure I’m not messing anything up.

That would work yes.

Fantastic. Thank you @sgugger!