Warning when adding compute_metrics function to Trainer

When I add a custom compute_metrics function to the Trainer, I get the warning “Not all data has been set. Are you sure you passed all values?” at each evaluation step.
This warning is defined in the finalize function of the class trainer_pt_utils.DistributedTensorGatherer:

if self._offsets[0] != self.process_length:
logger.warn(“Not all data has been set. Are you sure you passed all values?”)

This is my compute_metrics function:

def compute_metrics(eval_pred):
preds, labels = eval_pred
preds = np.argmax(preds, axis=1)

accuracy = round(accuracy_score(labels, preds),3)
micro_f1 = round(f1_score(labels, preds, average = "micro"),3)
macro_f1 = round(f1_score(labels, preds, average = "macro"),3)

return {"Accuracy": accuracy, "Micro F1": micro_f1, "Macro F1": macro_f1}

The additional metrics are successfully returned. So what does this warning mean? Any help would be much appreciated. :hugs:

1 Like

Could you share the code/script you use so we can reproduce on our side? It seems to indicate that not all the predictions where used for the metric computation (or it could jsut be a bug).

Thanks! Indeed this is the case. The last values of preds, which is passed to the compute_metrics function, were all equal to -100, the padding index.

Here is a reduced version of my setup that produces arrays with padded values only:

Code

import torch
import torch.nn as nn
from torch.utils.data import Dataset
from transformers import TrainingArguments, Trainer
import numpy as np
from sklearn.metrics import accuracy_score, f1_score

class CustomDataset(Dataset):

def __init__(self):
    
    self.input = [[0.,0.,0.,1.],[1.,0.,0.,1.],[1.,1.,1.,0.],[0.,0.,0.,1.],
                  [1.,0.,0.,1.],[1.,1.,1.,0.],[1.,0.,0.,1.],[1.,1.,1.,0.]]
    self.labels = [1,1,0,1,1,0,1,1]
    self.n_tokens = 4
    self.n_labels = 2
    
def __len__(self):
    
    return(len(self.input))

def __getitem__(self, idx):
    
    input_dict = {"inputs": torch.tensor(self.input[idx]),
        "label_ids": torch.tensor(self.labels[idx])}
    return input_dict

class MyTrainer(Trainer):
def compute_loss(self, model, inputs, return_outputs = False):
labels = inputs[“labels”].long()
logits = model(inputs[“inputs”])
loss_function = nn.CrossEntropyLoss()
loss = loss_function(logits,labels)
return (loss, logits) if return_outputs else loss

def compute_metrics(eval_pred):
preds, labels = eval_pred
print(preds)
preds = np.argmax(preds, axis=1)

accuracy = accuracy_score(labels, preds)
micro_f1 = f1_score(labels, preds, average = "micro")
macro_f1 = f1_score(labels, preds, average = "macro")

return {"Accuracy": accuracy, "Micro F1": micro_f1, "Macro F1": macro_f1}

dataset = CustomDataset()

n_tokens = dataset.n_tokens
n_hidden = 2
n_labels = dataset.n_labels

model = nn.Sequential(
nn.Linear(n_tokens,n_hidden),
nn.ReLU(),
nn.BatchNorm1d(n_hidden),
nn.Linear(n_hidden,n_hidden),
nn.ReLU(),
nn.BatchNorm1d(n_hidden),
nn.Linear(n_hidden, n_labels))

train_dataset = eval_dataset = dataset

args = TrainingArguments(output_dir = “example”,
report_to=[],
num_train_epochs = 3,
per_device_train_batch_size =2,
per_device_eval_batch_size = 1,
evaluation_strategy = “steps”,
logging_steps =4)

trainer = MyTrainer(
model = model,
args=args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
compute_metrics=compute_metrics
)

trainer.train()

Also I found that the shape of preds changes for different values of per_device_eval_batch_size . If I set eval_batch_size=2 in this setup, my compute_metrics function fails at argmax(axis =1), because the array is one-dimensional.

Generally, preds,

as it is used here,

preds = preds_gatherer.finalize() if not prediction_loss_only else None
metrics = self.compute_metrics(EvalPrediction(predictions=preds, label_ids=label_ids))

should simply be the (concatenated) model output, did I understand this correctly?

Hi Sylvain,
the Code is in the answer above, I missed to directly reply to your comment.

Some additional information:
In my full setup I am running this model for different datasets with different numbers of classes. The number of empty rows in the preds array ranges from 30%-90%. And here is the interesting relationship that I found: The number of empty rows divided by the number of classes was always 188. Not sure yet, what this means…

I can reproduce the warning. Will investigate more when I have a bit of time.

Thank you! :hugs:
Trainer would be very helpful for this part of my research pipeline, among other things because of the convenient hyperparameter search integration.

1 Like

So I changed the toy dataset (=train=eval dataset) to 16 examples and set the train and eval batch sizes to their default of 8.

Code

import torch
import torch.nn as nn
from torch.utils.data import Dataset
from transformers import TrainingArguments, Trainer
import numpy as np
from sklearn.metrics import accuracy_score, f1_score

class CustomDataset(Dataset):

def __init__(self):
    
    self.input = [[1.,0.,0.,0.],[0.,1.,0.,0.],[0.,0.,1.,0.],[0.,0.,0.,1.],
                  [1.,0.,0.,0.],[0.,1.,0.,0.],[0.,0.,1.,0.],[0.,0.,0.,1.],
                 [1.,0.,0.,0.],[0.,1.,0.,0.],[0.,0.,1.,0.],[0.,0.,0.,1.],
                 [1.,0.,0.,0.],[0.,1.,0.,0.],[0.,0.,1.,0.],[0.,0.,0.,1.]]
    self.labels = [1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0]
    self.n_tokens = 4
    self.n_labels = 2
    
def __len__(self):
    
    return(len(self.input))

def __getitem__(self, idx):
    
    input_dict = {"inputs": torch.tensor(self.input[idx]),
        "label_ids": torch.tensor(self.labels[idx])}
    return input_dict

class MyTrainer(Trainer):
def compute_loss(self, model, inputs, return_outputs = False):
labels = inputs[“labels”].long()
logits = model(inputs[“inputs”])
loss_function = nn.CrossEntropyLoss()
loss = loss_function(logits,labels)
return (loss, logits) if return_outputs else loss

def compute_metrics(eval_pred):
preds, labels = eval_pred
print(preds)
preds = np.argmax(preds, axis=1)

accuracy = accuracy_score(labels, preds)
micro_f1 = f1_score(labels, preds, average = "micro")
macro_f1 = f1_score(labels, preds, average = "macro")

return {"Accuracy": accuracy, "Micro F1": micro_f1, "Macro F1": macro_f1}

dataset = CustomDataset()

n_tokens = dataset.n_tokens
n_hidden = 2
n_labels = dataset.n_labels

model = nn.Sequential(
nn.Linear(n_tokens,n_hidden),
nn.ReLU(),
nn.BatchNorm1d(n_hidden),
nn.Linear(n_hidden,n_hidden),
nn.ReLU(),
nn.BatchNorm1d(n_hidden),
nn.Linear(n_hidden, n_labels))

train_dataset = eval_dataset = dataset

args = TrainingArguments(output_dir = “example”,
report_to=[],
num_train_epochs = 3,
per_device_train_batch_size =8,
per_device_eval_batch_size = 8,
evaluation_strategy = “steps”,
logging_steps =4)

trainer = MyTrainer(
model = model,
args=args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
compute_metrics=compute_metrics
)

trainer.train()

Here is what happens, when comparing the output of trainer.predict and trainer.model:

preds1 = trainer.model(dataset[0:16][“inputs”]).detach().numpy()
preds2 = trainer.predict(dataset)[0]

preds2 == np.concatenate((preds1[1:8], preds1[9:16], [[-100.,-100],[-100.,-100]]))
Output:
array([[ True, True],
[ True, True],
[ True, True],
[ True, True],
[ True, True],
[ True, True],
[ True, True],
[ True, True],
[ True, True],
[ True, True],
[ True, True],
[ True, True],
[ True, True],
[ True, True],
[ True, True],
[ True, True]])

So what got lost here is the first instance of both the two batches.

I changed it to a 3-class data set and the same two rows were missing, so in this small example there is no indication of the relationship between number of missing values and number of classes here that I saw in the full setup.

Edit:
I had to run the line preds1 = trainer.model(dataset[0:16][“inputs”]).detach().numpy() twice to get these results. The first call of trainer.model gives different results than subsequent ones.

I figured out your problem: the Trainer is optimized for the Transformers models so it expects the model outputs to always be a tuple, with the loss first and logits second if labels are provided, or just the logits if no labels are provided.

Therefore, in you custom compute_loss of your subclass of Trainer, the last line should be:

    return (loss, (loss, logits)) if return_outputs else loss
2 Likes

Thanks so much!!!

May I suggest to add this information to the Trainer documentation, either at the subclassing Trainer example at the beginning or at the compute_loss part.

Sure! If you want to make a PR with this info that would be great! For now there is just a note in the Trainer docstring that is rather vague, so anything more useful is more than welcome.