What does EvalPrediction.predictions contain exactly?

I want to implement a function for computing metrics and pass it to the Trainer. In the doc, EvalPrediction has 2 attributes: predictions and label_ids. It is written that both of them are of type ndarray but this is not the case for me.

The label_ids have is correct. It is ndarray and has shape (4, seqlen) where 4 is the number of samples in my validation. However, the attribute predictions are a tuple?

At index 0, I have an array of sizes (3, 4, 56, 32104). 4 again is the number of samples 56 is the sequence length and 32104 is the vocabulary size but what is the 3 then?

At index 1 I have first a tuple/list of tuples with size 4,6 and then an array of 4, 8, 56, 64

And at index 2 I have an array of size 4, 78, 512.

What are all these arrays actually? I think this should be clarified in the documentation.

Thanks for your help!

The Trainer will put in predictions everything your model returns (apart from the loss). So if you get multiple arrays, it’s likely because your model returns multiple things. No one can help you determine what they are without seeing your model (which is why you should always post the code you’re using when asking for help :wink: )

okay I understand but I am just using T5 model form the library so it is not like my own model or so. I can post the code anyways.

import torch
import argparse
import os
import sys
import numpy as np
import torch.nn.functional as F
sys.path.append('..')
from transformers import T5ForConditionalGeneration, Trainer, TrainingArguments
from data_reader import GetDataAsPython
from sklearn.model_selection import train_test_split
from prepare_data import create_data, create_dataset
from transformers import T5Tokenizer

parser = argparse.ArgumentParser()
parser.add_argument('-e', '--epochs', type=int, default=100)
parser.add_argument('-bs', '--batch-size', type=int, default=1)
parser.add_argument('-lr', '--learning-rate', type=float, default=1e-4)
parser.add_argument('-gcv', '--gradient-clip-val', type=float, default=0.0)
parser.add_argument('-wd', '--weight-decay', type=float, default=0.01)
args = parser.parse_args()

# delete the logs directory
model_name = "t5"
os.system("rm -rf ./logs" + model_name)
os.system("rm -rf ./results_" + model_name)

data = GetDataAsPython('../data_large.json')

train_inputs, train_labels, val_inputs, val_labels, test_inputs, test_labels = create_data(data, ['no-array-constructor'])

# from transformers import T5Tokenizer
tokenizer = T5Tokenizer.from_pretrained('t5-small')
print('len of tokenizer before adding: ', len(tokenizer))
tokenizer.add_tokens(['{', '}', '<', '>'])
train_dataset = create_dataset(train_inputs, train_labels, tokenizer, True)
val_dataset = create_dataset(val_inputs, val_labels, tokenizer, False)
test_dataset = create_dataset(test_inputs, test_labels, tokenizer, False)


def compute_val_metrics(eval_predictions):
    # print('\n')
    # print(len(eval_predictions.predictions[1]))
    # print(len(eval_predictions.predictions[1][0]))
    # print(eval_predictions.predictions[1][0][0].shape)

    return metrics

training_args = TrainingArguments(
    output_dir='./results_' + model_name,          
    num_train_epochs=args.epochs,              
    per_device_train_batch_size=args.batch_size,  
    per_device_eval_batch_size=4,   
    warmup_steps=500,                
    weight_decay=args.weight_decay,               
    logging_dir='./logs_' + model_name,
    logging_steps=10,
    do_eval=True,
    evaluation_strategy='epoch',
    learning_rate=args.learning_rate,
    load_best_model_at_end=True,
    metric_for_best_model='eval_loss',
    greater_is_better=False,
    # prediction_loss_only=True
)

model = T5ForConditionalGeneration.from_pretrained('t5-small', return_dict=True)
model.resize_token_embeddings(len(tokenizer))
# model.resize maybe depending on tokens

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    optimizers=[torch.optim.Adam(params=model.parameters(), lr=args.learning_rate), None],       
    tokenizer=tokenizer,
    compute_metrics=compute_val_metrics
)

trainer.train()

3 is
['logits', 'past_key_values', 'encoder_last_hidden_state']. This is what the seq2seq models return. logits is what you’ll need for computing metrics which is of the shape (bs, seq_len, vocab_size).

Also, for training seq2seq models consider using Seq2SeqTrainer, which supports generation during evaluation to be able to calculate generative metrics like blue, rouge etc.

Check finetune_trainer.py to see how to use Seq2SeqTrainer

1 Like

Thank you very much ! Now it is clear. I think that this information should also be added to the documentation. I will also give a look at Seq2SeqTrainer.

Thanks!

I’m unsure why you think the information is not in the documentation. At the T5 model under the returns section, I can see everything. Where do you think it’s missing?

1 Like

Yes you are right. I did not pay attention to that. However, models return a lot of things. Do you think that it would be easier to return a dictionary? In the format: {“encoder_outputs”: Tensor, “logits”: Tensor 
}

The models by themselves return that when you pass return_dict=True (which will soon become the default). We can definitely add some code to carry on that type of outputs when the option is selected to the predictions.

2 Likes

We can definitely add some code to carry on that type of outputs when the option is selected to the predictions.

Am I correct to understand this comment as: “it would be useful to return a dict when the forward outputs with return_dict=True”?

The code that builds logics which are ultimately fed into the EvalPrediction.predictions attribute is the following:

if isinstance(outputs, dict):
    logits = tuple(v for k, v in outputs.items() if k not in ignore_keys + ["loss"])

So whatever is returned by the dict.items() provides the order of components in the EvalPrediction.predictions.

However, relying on order like this increases the cognitive load required to work. Additionally, it’s more error prone, and can cause all sorts of problems without prior warning, specially if the output is created using a Mapping type that does not provide reliable ordering.

By setting EvalPrediction.predictions to be exactly what the user provided as output to the forward pass can simplify things considerably.

Would that be a correct assessment?