Differences in prediction from train end to checkpoint

when I finish training my model I do trainer.predict() and save the results. I use load_best_model = True to load best model. When I loaded the same model from checkpoint and do trainer.predict, I get a difference in the prediction starting from the ~3rd decimal point (e.g. 4.65234375 vs. 4.6540775299 | 4.671875 vs. 4.6732983589).
I checked and the weights of the last classification layer looked the same. I am using the same tokenizer from loaded checkpoint and same tokenizer function.
Is that make sense? What could have caused it?

Post the relevant code pls, to me it sounds you’re overseeing something.

These are the main blocks.
I use Longformer model, with model_max_length set to 900.
I used the same code to train the model from a pre-trained model and saved the eval results at the end of training. At the second time I loaded the model from “best_model_dir” checkpoint, skipped the training and just do Predict + Save eval results.
I got different results in the first and second time starting the ~3rd decimal point.
Hope you can help.

tokenizer = AutoTokenizer.from_pretrained(
    pretrained_model_name_or_path=model_checkpoint,
    use_fast=True,
    model_max_length=model_max_length,
)
model = AutoModelForSequenceClassification.from_pretrained(
    pretrained_model_name_or_path=model_checkpoint,
    num_labels=num_labels,
    problem_type="single_label_classification",
).to(device)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

def tokenizer_function(examples, tokenizer, text_column, max_length=512):
    return tokenizer(examples[text_column], truncation=True, max_length=max_length)

dataset = datasets.DatasetDict()
dataset["train"] = Dataset.from_pandas(df_train)
dataset["val"] = Dataset.from_pandas(df_val)

encoded_dataset = dataset.map(
    lambda x: tokenizer_function(x, tokenizer, TEXT_COL, max_length=model_max_length), batched=True, 
)

training_args = {
    "evaluation_strategy": "epoch",
    "learning_rate": 3e-5,
    "per_device_train_batch_size": 24,
    "per_device_eval_batch_size": 24,
    "num_train_epochs": 5,
    "weight_decay": 0.01,
    "save_total_limit": 2,
    "warmup_ratio": 0.01,
    "metric_for_best_model": "f1",
    "save_strategy": "epoch",
    "optim": "adamw_torch",
    "fp16": True,
    "load_best_model_at_end": True,
    "remove_unused_columns": True,
    "logging_first_step": True
}

args = TrainingArguments(output_dir=output_dir, **training_args)

trainer = Trainer(
    model=model,
    args=args,
    data_collator=data_collator,
    train_dataset=encoded_dataset["train"],
    eval_dataset=encoded_dataset["val"],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

trainer.train()
trainer.save_model(best_model_dir)

# Predict + Save eval results
y_pred = trainer.predict(encoded_dataset["val"])
y_probs = logits_to_probs(y_pred.predictions)
df_val["prediction"] = [pred for pred in y_pred.predictions]
df_val["probability"] = [prob for prob in y_probs]
df_val.to_json(eval_results_path)

I dont know that model or model kind yet, but keep the following in mind:

Prediction and evaluation are two completely different processes - thats one reason, why for the final test of your trained model you use testing, not evaluation. Eval is made for test progress evaluation/validation.

If you want to compare the values, then use the same function to create them. So instead of predict at the end use trainer.evaluate() not trainer.predict() at the end.

trainer.evaluate(trainer.eval_dataset)
# this will ensure your final metric is gathered the same way as during the training and the metric is calculated the exact same way too.

The evaluate method is basically the same that runs when evaluating during training and is based on set training_args and for not set values also on default values. For example in my training fo a different model, there are parameters that influence the resources used for evaluating - which also effects the results and the resulting metrics.

If you want to get the same values for predict and evaluate on your dataset - then you have to ensure, that all involved parameters in predict & evaluate, especially unset ones with default values, match 100% and that your ay of then computing the final metric, is also exactly the same.