Differences in prediction from train end to checkpoint

danabez1 · September 10, 2023, 12:45pm

when I finish training my model I do trainer.predict() and save the results. I use load_best_model = True to load best model. When I loaded the same model from checkpoint and do trainer.predict, I get a difference in the prediction starting from the ~3rd decimal point (e.g. 4.65234375 vs. 4.6540775299 | 4.671875 vs. 4.6732983589).
I checked and the weights of the last classification layer looked the same. I am using the same tokenizer from loaded checkpoint and same tokenizer function.
Is that make sense? What could have caused it?

ReatKay · September 10, 2023, 2:47pm

Post the relevant code pls, to me it sounds you’re overseeing something.

danabez1 · September 11, 2023, 4:27am

These are the main blocks.
I use Longformer model, with model_max_length set to 900.
I used the same code to train the model from a pre-trained model and saved the eval results at the end of training. At the second time I loaded the model from “best_model_dir” checkpoint, skipped the training and just do Predict + Save eval results.
I got different results in the first and second time starting the ~3rd decimal point.
Hope you can help.

tokenizer = AutoTokenizer.from_pretrained(
    pretrained_model_name_or_path=model_checkpoint,
    use_fast=True,
    model_max_length=model_max_length,
)
model = AutoModelForSequenceClassification.from_pretrained(
    pretrained_model_name_or_path=model_checkpoint,
    num_labels=num_labels,
    problem_type="single_label_classification",
).to(device)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

def tokenizer_function(examples, tokenizer, text_column, max_length=512):
    return tokenizer(examples[text_column], truncation=True, max_length=max_length)

dataset = datasets.DatasetDict()
dataset["train"] = Dataset.from_pandas(df_train)
dataset["val"] = Dataset.from_pandas(df_val)

encoded_dataset = dataset.map(
    lambda x: tokenizer_function(x, tokenizer, TEXT_COL, max_length=model_max_length), batched=True, 
)

training_args = {
    "evaluation_strategy": "epoch",
    "learning_rate": 3e-5,
    "per_device_train_batch_size": 24,
    "per_device_eval_batch_size": 24,
    "num_train_epochs": 5,
    "weight_decay": 0.01,
    "save_total_limit": 2,
    "warmup_ratio": 0.01,
    "metric_for_best_model": "f1",
    "save_strategy": "epoch",
    "optim": "adamw_torch",
    "fp16": True,
    "load_best_model_at_end": True,
    "remove_unused_columns": True,
    "logging_first_step": True
}

args = TrainingArguments(output_dir=output_dir, **training_args)

trainer = Trainer(
    model=model,
    args=args,
    data_collator=data_collator,
    train_dataset=encoded_dataset["train"],
    eval_dataset=encoded_dataset["val"],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

trainer.train()
trainer.save_model(best_model_dir)

# Predict + Save eval results
y_pred = trainer.predict(encoded_dataset["val"])
y_probs = logits_to_probs(y_pred.predictions)
df_val["prediction"] = [pred for pred in y_pred.predictions]
df_val["probability"] = [prob for prob in y_probs]
df_val.to_json(eval_results_path)

ReatKay · September 11, 2023, 2:36pm

I dont know that model or model kind yet, but keep the following in mind:

Prediction and evaluation are two completely different processes - thats one reason, why for the final test of your trained model you use testing, not evaluation. Eval is made for test progress evaluation/validation.

If you want to compare the values, then use the same function to create them. So instead of predict at the end use trainer.evaluate() not trainer.predict() at the end.

trainer.evaluate(trainer.eval_dataset)
# this will ensure your final metric is gathered the same way as during the training and the metric is calculated the exact same way too.

The evaluate method is basically the same that runs when evaluating during training and is based on set training_args and for not set values also on default values. For example in my training fo a different model, there are parameters that influence the resources used for evaluating - which also effects the results and the resulting metrics.

If you want to get the same values for predict and evaluate on your dataset - then you have to ensure, that all involved parameters in predict & evaluate, especially unset ones with default values, match 100% and that your ay of then computing the final metric, is also exactly the same.

Topic		Replies	Views
Different results predicting from trainer and model Beginners	6	8024	December 20, 2021
Evaluation results (metric) during training is different from the evaluation results at the end 🤗Transformers	4	3276	September 26, 2022
Same checkpoint produces different output 🤗Transformers	0	150	February 20, 2024
Finetune model outputs diffrent predictions at each run ? why? 🤗Transformers	0	372	December 15, 2021
Evaluate model at saved checkpoint 🤗Transformers	0	1300	June 22, 2021

Differences in prediction from train end to checkpoint

Related topics