Slightly different output from trainer.predict and pipeline(..., function_to_apply="none")


I have a trained multi-label text classifier that I want to use for inference. I loaded it into a pipeline("text-classification", "./model"). When calculating performance in the downstream application I noticed how the performance metrics were noticably worse than those I obtained during training (both calculated on the same held-out test set).

Here is a histogram or the difference in outputs between the models:
Only about 1/4th of the measurements in the 0-bar are exactly equal to 0.

I have been scratching my head about this thing for the whole day and any help would be greatly appreciated :slight_smile:

Edit: I tested running the pipeline on the GPU and I get the same results. I also checked the parameters of the models in both the trainer and pipeline and they are identical.