Different results predicting from trainer and model

ArnauC · December 17, 2021, 11:28am

Hi, I’m training a simple classification model and I’m experiencing an unexpected behaviour:
When the training ends, I predict with the model loaded at the end with:
predictions = trainer.predict(tokenized_test_dataset)
list(np.argmax(predictions.predictions, axis=-1))

and I obtain predictions which match the accuracy obtained during the training

(the model loaded at the end of the training is the best of the training, I’m using load_best_model_at_end=True).

However, if I load the model from the checkpoing (the best one), and get predictions with:
logits = model(model_inputs)
probabilities = torch.nn.functional.softmax(logits.logits, dim=-1)
predictions = torch.argmax(probabilities, axis=1)

I get predictions which are slightly different from the previous ones and do not match the accuracy of the training.

So, anything I’m missing? Shouldn’t these predicitions be exactly equal? Any help would be appreciated!

sgugger · December 17, 2021, 1:00pm

It’s hard to know where the problem lies without seeing the whole code. It could be that your model_inputs are defined differently than in the tokenized_test_dataset for instance.

emmakelo · December 17, 2021, 2:52pm

Hello,

I also get the same phenomem. When I loaded the best model to only do the test , the model output only the same class. But if I train and test directly I get good results. If I just test with the predict() function. I get prediction of same single class. I do not understand why?

best_model_from_training_testing = './checkpoint-900'
best_model= FlaubertForSequenceClassification.from_pretrained(best_model_from_training_testing, num_labels=3)

trainer = Trainer(best_model)
raw_pred, _, _ = trainer.predict(test_tokenized_dataset)         
predictedLabelOnCompanyData = np.argmax(raw_pred, axis=1)

1st run : [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1]
2nd run : [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1]
3rd run : [2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2]  # another class

ArnauC · December 20, 2021, 9:59am

I don’t know if there’s a cleaner way to share my code, but here it comes

For the training:

training_name = 'training 2'

import datasets
from datasets import load_dataset
import transformers

train_dataset = load_dataset('csv', data_files=f'trainings/{training_name}/train_dataset.csv')
test_dataset = load_dataset('csv', data_files=f'trainings/{training_name}/test_dataset.csv')

print(f"Longitud train dataset {len(train_dataset['train'])}")
print(f"Longitud test dataset {len(test_dataset['train'])}")

model_name = 'bert-base-cased'

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_name)

def tokenize_function(examples):
    result =  tokenizer(examples["input"], padding="max_length", truncation=True)
    result['label'] = examples['patent_type_id']
    return result

tokenized_train_dataset = train_dataset.map(tokenize_function, batched=True)
tokenized_test_dataset = test_dataset.map(tokenize_function, batched=True)

tokenized_train_dataset = tokenized_train_dataset.remove_columns('input')
tokenized_train_dataset = tokenized_train_dataset.remove_columns('patent_type_id')
tokenized_train_dataset = tokenized_train_dataset.remove_columns('patent_type')
tokenized_train_dataset = tokenized_train_dataset.remove_columns('patent')

tokenized_test_dataset = tokenized_test_dataset.remove_columns('input')
tokenized_test_dataset = tokenized_test_dataset.remove_columns('patent_type_id')
tokenized_test_dataset = tokenized_test_dataset.remove_columns('patent_type')
tokenized_test_dataset = tokenized_test_dataset.remove_columns('patent')

tokenized_test_dataset['train']


tokenized_train_dataset = tokenized_train_dataset["train"]
tokenized_test_dataset = tokenized_test_dataset["train"]

import pandas as pd

pd.read_csv(f'trainings/{training_name}/test_dataset.csv', header=0).groupby(['patent_type'])['patent'].describe()[['count']]

from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=6)


from transformers import TrainingArguments

batch_size = 32
training_args = TrainingArguments(f"classification_patent_types_{training_name}",
                                  evaluation_strategy="epoch",
                                  save_strategy = "epoch",
                                  save_total_limit=3,
                                  load_best_model_at_end=True,
                                  logging_strategy = "epoch",
                                  per_device_train_batch_size=batch_size,
                                  per_device_eval_batch_size=batch_size,
                                  num_train_epochs=30,
                                  seed=1234
                                  )

import numpy as np
from datasets import load_metric

metric = load_metric("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

from transformers import Trainer

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train_dataset,
    eval_dataset=tokenized_test_dataset,
    compute_metrics=compute_metrics
)

trainer.train()

# Predicting with model

predictions = trainer.predict(tokenized_test_dataset)
list(np.argmax(predictions.predictions, axis=-1))

import pandas as pd

test_dataset = pd.read_csv(f'trainings/{training_name}/test_dataset.csv', header = 0)

test_dataset['predictions'] = list(np.argmax(predictions.predictions, axis=-1))
test_dataset['correct'] = test_dataset['patent_type_id'] == test_dataset['predictions']
test_dataset

corrects = sum(test_dataset['correct'])
total = len(test_dataset)
fraction = corrects/total
print(f"{corrects} corrects out of {total}, which makes an accuracy of: {fraction}")

And for the predictions:


training_name = 'training 2'


label_dict = {
    0:'fdf',
    1:'markush',
    2:'polymorph',
    3:'process',
    4:'psd',
    5:'use'}
label_dict


import pandas as pd

pd.read_csv(f'trainings/{training_name}/train_dataset.csv', header=0)[['patent_type_id','patent_type']].groupby(['patent_type_id']).agg('first')

import datasets
from datasets import load_dataset
import transformers

import pandas as pd

test_dataset = pd.read_csv(f'trainings/{training_name}/test_dataset.csv', header = 0)
test_dataset_list = list(test_dataset['input'])

from transformers import AutoTokenizer

model_name = 'bert-base-cased'
tokenizer = AutoTokenizer.from_pretrained(model_name)

import torch

encoded_test_dataset = tokenizer(test_dataset_list, padding="max_length", truncation = True)
model_inputs = torch.tensor(encoded_test_dataset['input_ids'])

from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(f'trainings/{training_name}/classification_patent_types_{training_name}/checkpoint-78')

import numpy as np

logits = model.forward(model_inputs)

probabilities = torch.nn.functional.softmax(logits.logits, dim=-1)
predictions = torch.argmax(probabilities, axis=1)
prediction_labels = [label_dict[int(el)] for el in predictions]


predictions

test_dataset['predictions'] = prediction_labels
test_dataset['correct'] = test_dataset['patent_type'] == test_dataset['predictions']
test_dataset

corrects = sum(test_dataset['correct'])
total = len(test_dataset)
fraction = corrects/total
print(f"{corrects} corrects out of {total}, which makes an accuracy of: {fraction}")

I have manually checked and the model_inputs, and tokenized_test_dataset[‘input_ids’] appear to be equal, so the error doesn’t seem to be in this point…

ArnauC · December 20, 2021, 10:09am

Ok, I seem to have identified the source of the error:
When tokenizing the data for the predictions, if I use the argument return_tensors=“pt”, and then I predict with logits = model(**model_inputs) instead of logits = model(model_inputs), I obtain results which are equal to the ones obtained with trainer. To clarify:

Old code (leads to incorrect behaviour):

encoded_test_dataset = tokenizer(test_dataset_list, padding="max_length", truncation = True)
model_inputs = torch.tensor(encoded_test_dataset['input_ids'])
logits = model(model_inputs)

New code (leads to correct behaviour):

model_inputs = tokenizer(test_dataset_list, padding="max_length", truncation = True, return_tensors="pt")
logits = model(**model_inputs)

Could you please shed any light on why this different behaviour?
Thanks a lot

sgugger · December 20, 2021, 1:50pm

You are not passing the attention mask when doing logits = model(model_inputs) in the first sample. If you look at the model_inputs in the second sample, you will see it has several keys: input IDs and attention mask in particular. This second tensor tells your model the tokens to ignore because of padding, ans, as you observed, it yields to different results.

This is also explained in depth in the course.

emmakelo · December 20, 2021, 6:21pm

Hello, i tried your solution but I still get difference between predictions after training and predictions after inference using the same test_file. I do not know how to do.

Topic		Replies	Views
Differences in prediction from train end to checkpoint Beginners	3	876	September 11, 2023
Finetune model outputs diffrent predictions at each run ? why? 🤗Transformers	0	378	December 15, 2021
AttributeError: 'Flaubert For Sequence Classification' object has no attribute 'predict' 🤗Transformers	2	3253	December 20, 2021
Same checkpoint produces different output 🤗Transformers	0	151	February 20, 2024
Huggingface classification struggling with prediction 🤗Transformers	0	839	April 5, 2022

Different results predicting from trainer and model

Related topics