Trainer class, compute_metrics and EvalPrediction

Hello everybody,

I am trying to use my own metric for a summarization task passing the compute_metrics to the Trainer class. I would like to calculate rouge 1, 2, L between the predictions of my model (fine-tuned T5) and the labels.

However, I have a problem understanding what the Trainer gives to the function. The EvalPrediction object should be composed of predictions and label_ids. To my understanding, since I am truncating the summaries to 150 characters, the predictions and label_ids should be vectors of size 150, or (batch_size, 150). Surprisingly, I get the predictions to be nested tuples, of size (23, 150, 32128) or (23, 12, 150, 64) or (23, 12, 512, 64), are these logits vectors? While the label_ids is a tuple with 23 vectors of size 150.

Can you kindly help me to understand what I should expect from the object EvalPrediction?

Thank you.

2 Likes

The predictions are the outputs of your model. Without seeing your model, no one can help you figure out what they are.

Hi @sgugger. Thank you for your reply.

The model I am using is T5. This is how I initialize the model and the dataset:

from transformers import T5ForConditionalGeneration, T5Tokenizer

t5 = T5ForConditionalGeneration.from_pretrained('t5-base')
t5_tokenizer = T5Tokenizer.from_pretrained('t5-base')

class CustomDataset(Dataset):
    
    def __init__(self, data, tokenizer, input_len, summ_len, eval=False):
        self.tokenizer = tokenizer
        self.data = data
        self.input_len = input_len
        self.summ_len = summ_len
        self.eval = eval
        self.article = self.data.article
        self.summary = self.data.summary
    
    def __len__(self):
        return len(self.summary)
    
    def __getitem__(self, idx):
        item = {}
   
        article = str(self.article[idx])
        article = ' '.join(article.split())
        summary = str(self.summary[idx])
        summary = ' '.join(summary.split())

        source = self.tokenizer.batch_encode_plus(
            [article],
            max_length = self.input_len,
            truncation = True,
            padding = 'max_length',
            return_tensors = 'pt')
        target = self.tokenizer.batch_encode_plus(
            [summary],
            max_length = self.summ_len,
            truncation = True,
            padding = 'max_length',
            return_tensors='pt')
        
        item['input_ids'] = source['input_ids'].squeeze()
        item['attention_mask'] = source['attention_mask'].squeeze()
        
        y = target['input_ids'].squeeze()
        if not self.eval:
            y[y == self.tokenizer.pad_token_id] = -100
        
        item['labels'] = y
        
        return item

train_dataset = CustomDataset(
    train_dataset,
    t5_tokenizer,
    MAX_LEN,
    SUMMARY_LEN)
val_dataset = CustomDataset(
    val_dataset,
    t5_tokenizer,
    MAX_LEN,
    SUMMARY_LEN,
    eval = True)

This is how i set the TrainingArgs and the Trainer:

training_args = TrainingArguments(
    output_dir = '/content/drive/My Drive/t5_newssummary_train',
    overwrite_output_dir = True,
    do_train = True,
    evaluation_strategy = 'steps',
    eval_steps = 10,
    #prediction_loss_only = True,
    per_device_train_batch_size = TRAIN_BATCH_SIZE,
    per_device_eval_batch_size = VALID_BATCH_SIZE,
    num_train_epochs = TRAIN_EPOCHS,
    learning_rate = LEARNING_RATE,
    logging_steps = 10,
    seed = SEED,
    dataloader_num_workers = 0,
    run_name = 'doing_eval',
    logging_dir = '/content/drive/My Drive/t5_newssummary_train/logs',
    disable_tqdm = True
)
trainer = Trainer(
    model = t5,
    args = training_args,
    train_dataset = train_dataset,
    compute_metrics = my_compute_metrics,
    eval_dataset = val_dataset,
    optimizers = optimizers
)

And this is my compute_metrics function:

from transformers import EvalPrediction

def my_compute_metrics(p: EvalPrediction):
    predictions = p.predictions
    print("predictions")
    print(len(predictions))
    print_predictions(predictions)
    references = p.label_ids
    print("references")
    for r in references:
        print(r.shape)

    return {'marco': 1}

The print_predictions function only prints the tuple object.

The output I get when evaluating is

predictions
3
(23, 150, 32128)
new tuple 12
new tuple 4
(23, 12, 150, 64)
(23, 12, 150, 64)
(23, 12, 512, 64)
(23, 12, 512, 64)
new tuple 4
(23, 12, 150, 64)
(23, 12, 150, 64)
(23, 12, 512, 64)
(23, 12, 512, 64)
new tuple 4
(23, 12, 150, 64)
(23, 12, 150, 64)
(23, 12, 512, 64)
(23, 12, 512, 64)
new tuple 4
(23, 12, 150, 64)
(23, 12, 150, 64)
(23, 12, 512, 64)
(23, 12, 512, 64)
new tuple 4
(23, 12, 150, 64)
(23, 12, 150, 64)
(23, 12, 512, 64)
(23, 12, 512, 64)
new tuple 4
(23, 12, 150, 64)
(23, 12, 150, 64)
(23, 12, 512, 64)
(23, 12, 512, 64)
new tuple 4
(23, 12, 150, 64)
(23, 12, 150, 64)
(23, 12, 512, 64)
(23, 12, 512, 64)
new tuple 4
(23, 12, 150, 64)
(23, 12, 150, 64)
(23, 12, 512, 64)
(23, 12, 512, 64)
new tuple 4
(23, 12, 150, 64)
(23, 12, 150, 64)
(23, 12, 512, 64)
(23, 12, 512, 64)
new tuple 4
(23, 12, 150, 64)
(23, 12, 150, 64)
(23, 12, 512, 64)
(23, 12, 512, 64)
new tuple 4
(23, 12, 150, 64)
(23, 12, 150, 64)
(23, 12, 512, 64)
(23, 12, 512, 64)
new tuple 4
(23, 12, 150, 64)
(23, 12, 150, 64)
(23, 12, 512, 64)
(23, 12, 512, 64)
(23, 512, 768)
references
(150,)
(150,)
(150,)
(150,)
(150,)
(150,)
(150,)
(150,)
(150,)
(150,)
(150,)
(150,)
(150,)
(150,)
(150,)
(150,)
(150,)
(150,)
(150,)
(150,)
(150,)
(150,)
(150,)
{'eval_loss': 7.113981246948242, 'eval_marco': 1, 'epoch': 0.18518518518518517}

I made my compute_metrics function to print because I was getting weird predictions and labels. As I said in my initial post, what I am expecting from the model is 2 predictions (because of the batch size of 2) and 2 labels. What I get is 23 labels and some strange tuples for the predictions.

Thank you very much for your answer again, hope you can help me sort this out.

I donā€™t know which head of the T5 model you are using since you didnā€™t show how the t5 object was created, so I canā€™t point you out to its documentation to look together at the outputs it returns.

@sgugger thank you. I edited my reply to be complete.

I am very sorry for wasting your time.

Hi @marcoabrate

The T5ForConditionalGeneration model returns a tuple which contains ['logits', 'past_key_values', 'encoder_last_hidden_state'].

To be able to calculate generative metrics, we need to generate the seq during evaluation, we canā€™t calculate these metrics using the logits

The examples/seq2seq here supports seqseq training (summrization, translation) and also computes the appropriate metrics (ROUGE, BLUE etc).

For seq2seq training consider using Seq2SeqTrainer and fintune_trainer.py (which uses Trainer) or finetune.py (which uses pytorch-lightning).

Perfect. Thank you for all the information!