Different results predicting from trainer and model

Hi, I’m training a simple classification model and I’m experiencing an unexpected behaviour:
When the training ends, I predict with the model loaded at the end with:
predictions = trainer.predict(tokenized_test_dataset)
list(np.argmax(predictions.predictions, axis=-1))

and I obtain predictions which match the accuracy obtained during the training

(the model loaded at the end of the training is the best of the training, I’m using load_best_model_at_end=True).

However, if I load the model from the checkpoing (the best one), and get predictions with:
logits = model(model_inputs)
probabilities = torch.nn.functional.softmax(logits.logits, dim=-1)
predictions = torch.argmax(probabilities, axis=1)

I get predictions which are slightly different from the previous ones and do not match the accuracy of the training.

So, anything I’m missing? Shouldn’t these predicitions be exactly equal? Any help would be appreciated!

It’s hard to know where the problem lies without seeing the whole code. It could be that your model_inputs are defined differently than in the tokenized_test_dataset for instance.

Hello,

I also get the same phenomem. When I loaded the best model to only do the test , the model output only the same class. But if I train and test directly I get good results. If I just test with the predict() function. I get prediction of same single class. I do not understand why?

best_model_from_training_testing = './checkpoint-900'
best_model= FlaubertForSequenceClassification.from_pretrained(best_model_from_training_testing, num_labels=3)

trainer = Trainer(best_model)
raw_pred, _, _ = trainer.predict(test_tokenized_dataset)         
predictedLabelOnCompanyData = np.argmax(raw_pred, axis=1)
1st run : [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1]
2nd run : [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1]
3rd run : [2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2]  # another class

I don’t know if there’s a cleaner way to share my code, but here it comes

For the training:

training_name = 'training 2'

import datasets
from datasets import load_dataset
import transformers

train_dataset = load_dataset('csv', data_files=f'trainings/{training_name}/train_dataset.csv')
test_dataset = load_dataset('csv', data_files=f'trainings/{training_name}/test_dataset.csv')

print(f"Longitud train dataset {len(train_dataset['train'])}")
print(f"Longitud test dataset {len(test_dataset['train'])}")

model_name = 'bert-base-cased'

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_name)

def tokenize_function(examples):
    result =  tokenizer(examples["input"], padding="max_length", truncation=True)
    result['label'] = examples['patent_type_id']
    return result

tokenized_train_dataset = train_dataset.map(tokenize_function, batched=True)
tokenized_test_dataset = test_dataset.map(tokenize_function, batched=True)

tokenized_train_dataset = tokenized_train_dataset.remove_columns('input')
tokenized_train_dataset = tokenized_train_dataset.remove_columns('patent_type_id')
tokenized_train_dataset = tokenized_train_dataset.remove_columns('patent_type')
tokenized_train_dataset = tokenized_train_dataset.remove_columns('patent')

tokenized_test_dataset = tokenized_test_dataset.remove_columns('input')
tokenized_test_dataset = tokenized_test_dataset.remove_columns('patent_type_id')
tokenized_test_dataset = tokenized_test_dataset.remove_columns('patent_type')
tokenized_test_dataset = tokenized_test_dataset.remove_columns('patent')

tokenized_test_dataset['train']


tokenized_train_dataset = tokenized_train_dataset["train"]
tokenized_test_dataset = tokenized_test_dataset["train"]

import pandas as pd

pd.read_csv(f'trainings/{training_name}/test_dataset.csv', header=0).groupby(['patent_type'])['patent'].describe()[['count']]

from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=6)


from transformers import TrainingArguments

batch_size = 32
training_args = TrainingArguments(f"classification_patent_types_{training_name}",
                                  evaluation_strategy="epoch",
                                  save_strategy = "epoch",
                                  save_total_limit=3,
                                  load_best_model_at_end=True,
                                  logging_strategy = "epoch",
                                  per_device_train_batch_size=batch_size,
                                  per_device_eval_batch_size=batch_size,
                                  num_train_epochs=30,
                                  seed=1234
                                  )

import numpy as np
from datasets import load_metric

metric = load_metric("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

from transformers import Trainer

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train_dataset,
    eval_dataset=tokenized_test_dataset,
    compute_metrics=compute_metrics
)

trainer.train()

# Predicting with model

predictions = trainer.predict(tokenized_test_dataset)
list(np.argmax(predictions.predictions, axis=-1))

import pandas as pd

test_dataset = pd.read_csv(f'trainings/{training_name}/test_dataset.csv', header = 0)

test_dataset['predictions'] = list(np.argmax(predictions.predictions, axis=-1))
test_dataset['correct'] = test_dataset['patent_type_id'] == test_dataset['predictions']
test_dataset

corrects = sum(test_dataset['correct'])
total = len(test_dataset)
fraction = corrects/total
print(f"{corrects} corrects out of {total}, which makes an accuracy of: {fraction}")

And for the predictions:


training_name = 'training 2'


label_dict = {
    0:'fdf',
    1:'markush',
    2:'polymorph',
    3:'process',
    4:'psd',
    5:'use'}
label_dict


import pandas as pd

pd.read_csv(f'trainings/{training_name}/train_dataset.csv', header=0)[['patent_type_id','patent_type']].groupby(['patent_type_id']).agg('first')

import datasets
from datasets import load_dataset
import transformers

import pandas as pd

test_dataset = pd.read_csv(f'trainings/{training_name}/test_dataset.csv', header = 0)
test_dataset_list = list(test_dataset['input'])

from transformers import AutoTokenizer

model_name = 'bert-base-cased'
tokenizer = AutoTokenizer.from_pretrained(model_name)

import torch

encoded_test_dataset = tokenizer(test_dataset_list, padding="max_length", truncation = True)
model_inputs = torch.tensor(encoded_test_dataset['input_ids'])

from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(f'trainings/{training_name}/classification_patent_types_{training_name}/checkpoint-78')

import numpy as np

logits = model.forward(model_inputs)

probabilities = torch.nn.functional.softmax(logits.logits, dim=-1)
predictions = torch.argmax(probabilities, axis=1)
prediction_labels = [label_dict[int(el)] for el in predictions]


predictions

test_dataset['predictions'] = prediction_labels
test_dataset['correct'] = test_dataset['patent_type'] == test_dataset['predictions']
test_dataset

corrects = sum(test_dataset['correct'])
total = len(test_dataset)
fraction = corrects/total
print(f"{corrects} corrects out of {total}, which makes an accuracy of: {fraction}")

I have manually checked and the model_inputs, and tokenized_test_dataset[‘input_ids’] appear to be equal, so the error doesn’t seem to be in this point…

Ok, I seem to have identified the source of the error:
When tokenizing the data for the predictions, if I use the argument return_tensors=“pt”, and then I predict with logits = model(**model_inputs) instead of logits = model(model_inputs), I obtain results which are equal to the ones obtained with trainer. To clarify:

Old code (leads to incorrect behaviour):

encoded_test_dataset = tokenizer(test_dataset_list, padding="max_length", truncation = True)
model_inputs = torch.tensor(encoded_test_dataset['input_ids'])
logits = model(model_inputs)

New code (leads to correct behaviour):

model_inputs = tokenizer(test_dataset_list, padding="max_length", truncation = True, return_tensors="pt")
logits = model(**model_inputs)

Could you please shed any light on why this different behaviour?
Thanks a lot

You are not passing the attention mask when doing logits = model(model_inputs) in the first sample. If you look at the model_inputs in the second sample, you will see it has several keys: input IDs and attention mask in particular. This second tensor tells your model the tokens to ignore because of padding, ans, as you observed, it yields to different results.

This is also explained in depth in the course.

Hello, i tried your solution but I still get difference between predictions after training and predictions after inference using the same test_file. I do not know how to do.