Evaluating Finetuned BERT Model for Sequence Classification

Python 3.7.6
Transformers 4.4.2
Pytorch 1.8.0

Hi HF Community!

I would like to finetune BERT for sequence classification on some training data I have and also evaluate the resulting model. I am using the Trainer class to do the training and am a little confused on what the evaluation is doing. Below is my code:

import torch
from torch.utils.data import Dataset
from transformers import BertForSequenceClassification, BertTokenizer, Trainer, TrainingArguments
import pandas as pd

class MyDataset(Dataset):
    def __init__(self, csv_file: str):
        self.df = pd.read_csv(csv_file, encoding='ISO-8859-1')
        self.tokenizer = BertTokenizer.from_pretrained("bert-base-uncased", padding_side='right', local_files_only=True)
        self.label_list = self.df['label'].value_counts().keys().to_list()

    def __len__(self) -> int:
        return len(self.df)

    def __getitem__(self, idx: int) -> str:
        if torch.is_tensor(idx):
            idx = idx.tolist()

        text = self.df.iloc[idx, 1]
        tmp_label = self.df.iloc[idx, 3]
        if tmp_label != 'label_a':
            label = 1
        else:
            label = 0
        return (text, label)



def data_collator(self, dataset_samples_list):
    tokenizer = BertTokenizer.from_pretrained("bert-base-uncased", padding_side='right', local_files_only=True)
    examples = [example[0] for example in dataset_samples_list]
    encoded_results = tokenizer(examples, padding=True, truncation=True, return_tensors='pt',
                                return_attention_mask=True)

    batch = {}
    batch['input_ids'] = torch.stack([result for result in encoded_results['input_ids']])
    batch['attention_mask'] = torch.stack([result for result in encoded_results['attention_mask']])
    batch['labels'] = torch.stack([torch.tensor(example[1]) for example in dataset_samples_list])
    return batch


train_data_obj = MyDataset('/path/to/train/data.csv')
eval_data_obj = MyDataset('/path/to/eval/data.csv')
model = BertForSequenceClassification.from_pretrained("bert-base-uncased")


training_args = TrainingArguments(
    output_dir='/path/to/output/dir',
    do_train=True,
    do_eval=True,
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    evaluation_strategy='epoch',
    num_train_epochs=2,
    save_steps=10,
    gradient_accumulation_steps=4,
    dataloader_drop_last=True
)

trainer = Trainer(
            model=model,
            args=training_args,
            train_dataset=train_data_obj,
            eval_dataset=eval_data_obj,
            data_collator=data_collator
        )

trainer.train()
trainer.save_model("/path/to/model/save/dir")
trainer.evaluate()

As I understand, once trainer.train() is called, after each epoch the model will be evaluated on the dataset from eval_data_obj and those results will be displayed. After the training is done and the model is saved using trainer.save_model("/path/to/model/save/dir"), trainer.evaluate() will evaluate the saved model on the eval_data_obj and return a dict containing the evaluation loss. Are there other metrics like accuracy that are included in this dict by default? Thank you in advance for your help!

1 Like

If you want other metrics, you have to indicate that to the Trainer by passing a compute_metrics function. See for instance our official GLUE example or the corresponding notebook.

@sgugger Thank you for the reply, it worked perfectly!
One quick follow up question. If I have finetuned a model and saved it off just after training, what is the best way to load that model and evaluate it on a test set?

1 Like

You can call Trainer.evaluate on any dataset you want, so just reload it and pass it to Trainer the same way as during training, then run that method.

I see. So I could then specify the location of the newly finetuned model in Trainer, load the eval dataset, pass the eval dataset to Trainer, then run Trainer.evaluate? Just want to make sure I’m not messing anything up.

That would work yes.

Fantastic. Thank you @sgugger!

Hey @sgugger and @aclifton314

Sorry to bump into your conversation, but I have a similar problem and I can’t make the code work, could you please help?

So basically I have a SQuAD format file and I want to evaluate it with one of the models in the HF repo. After this, I will use a fine-tuned model to do the same thing.

I was able to fine tune the model, but I can’t evaluate any model (not even the base one) on such file.

Here is my code:

from transformers import AutoTokenizer, AutoModelForQuestionAnswering, Trainer, TrainingArguments
import torch
from transformers import default_data_collator
import json

# Model from HuggingFace
model_checkpoint = 'mrm8488/bert-italian-finedtuned-squadv1-it-alfa'

# Import tokenizer
my_tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

# Import model
my_model = AutoModelForQuestionAnswering.from_pretrained(model_checkpoint)

# Dataset for evaluation
eval_data_path = '/content/drive/MyDrive/BERT/SQuAD_files/result.json'

with open(eval_data_path) as json_file:
  data = json.load(json_file)

data_collator = default_data_collator
trainer = Trainer(
    my_model,
    data_collator=data_collator,
    tokenizer=my_tokenizer
)

trainer.evaluate(data)

This is the error I get:

/usr/local/lib/python3.7/dist-packages/transformers/trainer.py in evaluate(self, eval_dataset, ignore_keys, metric_key_prefix)
   2006             prediction_loss_only=True if self.compute_metrics is None else None,
   2007             ignore_keys=ignore_keys,
-> 2008             metric_key_prefix=metric_key_prefix,
   2009         )
   2010 

/usr/local/lib/python3.7/dist-packages/transformers/trainer.py in evaluation_loop(self, dataloader, description, prediction_loss_only, ignore_keys, metric_key_prefix)
   2145         observed_num_examples = 0
   2146         # Main evaluation loop
-> 2147         for step, inputs in enumerate(dataloader):
   2148             # Update the observed num examples
   2149             observed_batch_size = find_batch_size(inputs)

/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py in __next__(self)
    515             if self._sampler_iter is None:
    516                 self._reset()
--> 517             data = self._next_data()
    518             self._num_yielded += 1
    519             if self._dataset_kind == _DatasetKind.Iterable and \

/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py in _next_data(self)
    555     def _next_data(self):
    556         index = self._next_index()  # may raise StopIteration
--> 557         data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
    558         if self._pin_memory:
    559             data = _utils.pin_memory.pin_memory(data)

/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/fetch.py in fetch(self, possibly_batched_index)
     42     def fetch(self, possibly_batched_index):
     43         if self.auto_collation:
---> 44             data = [self.dataset[idx] for idx in possibly_batched_index]
     45         else:
     46             data = self.dataset[possibly_batched_index]

/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/fetch.py in <listcomp>(.0)
     42     def fetch(self, possibly_batched_index):
     43         if self.auto_collation:
---> 44             data = [self.dataset[idx] for idx in possibly_batched_index]
     45         else:
     46             data = self.dataset[possibly_batched_index]

KeyError: 0

A glimpse of my SQuAD formatted file:

{
  "data": [
    {
      "paragraphs": [
        {
          "qas": [
            {
              "question": "Qual è l’età?",
              "id": 78079,
              "answers": [
                {
                  "answer_id": 89658,
                  "document_id": 84480,
                  "question_id": 78079,
                  "text": "02/01/1966",
                  "answer_start": 113,
                  "answer_category": "SHORT"
                }
              ],
              "is_impossible": false
            },
            {
              "question": "Qual è il titolo di studio?",
              "id": 78082,
              "answers": [
                {
                  "answer_id": 89661,
                  "document_id": 84480,
                  "question_id": 78082,
                  "text": "media superiore",
                  "answer_start": 1157,
                  "answer_category": "SHORT"
                }
              ],
              "is_impossible": false
            },
    ...]
    "context" = "..."
    "document_id" = "..."

@Neuroinformatica sorry for the delay. Let me see if I can help out.
Maybe one thing to try is to declare the path to your evaluation data in the Trainer object. Something like this:

trainer = Trainer(
    my_model,
    data_collator=data_collator,
    tokenizer=my_tokenizer,
    eval_dataset=eval_data_path
)
trainer.evaluate()

I’m not sure how much of a difference that will make, but it’s worth a shot to knock out some low hanging fruit.

I could be wrong about this, but the error looks like it’s trying to access a key in a dictionary that equals 0. My intuition is telling me that somewhere in the data collation or loading the data, there needs to be a translation between the labels and their index location in a list. So if I had a list of labels ['red', 'orange', 'blue'] I would need to make a translation to their index value that would look like [0, 1, 2]. That’s the reason why I think the KeyError: 0 error is occurring. That being said, I’m not too familiar with using bert for SQUAD.

Another option to consider is to pass the labels list into the trainer object. To do this, you would also need to created a TrainingArguments object. Something like this:

from transformers import AutoTokenizer, AutoModelForQuestionAnswering, Trainer, TrainingArguments
import torch
from transformers import default_data_collator
import json

# Model from HuggingFace
model_checkpoint = 'mrm8488/bert-italian-finedtuned-squadv1-it-alfa'

# Import tokenizer
my_tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

# Import model
my_model = AutoModelForQuestionAnswering.from_pretrained(model_checkpoint)

# Dataset for evaluation
eval_data_path = '/content/drive/MyDrive/BERT/SQuAD_files/result.json'

with open(eval_data_path) as json_file:
  data = json.load(json_file)

data_collator = default_data_collator

my_train_args = TrainingArguments(
     output_dir = '/path/to/where/you/want/the/output',
     label_names = list_of_label_names
)

trainer = Trainer(
    my_model,
    data_collator=data_collator,
    tokenizer=my_tokenizer, 
    args=my_train_args
)

trainer.evaluate(data)

Like I said, I’m not super familiar with using bert for QA tasks so maybe @sgugger has some better insight.

1 Like

many thanks @aclifton314, I will try this and let you know!

1 Like

this worked for me! thanks

1 Like