Hello,
I have a trained multi-label text classifier that I want to use for inference. I loaded it into a pipeline("text-classification", "./model")
. When calculating performance in the downstream application I noticed how the performance metrics were noticably worse than those I obtained during training (both calculated on the same held-out test set).
Here is a histogram or the difference in outputs between the models:
Only about 1/4th of the measurements in the 0-bar are exactly equal to 0.
I have been scratching my head about this thing for the whole day and any help would be greatly appreciated
Edit: I tested running the pipeline on the GPU and I get the same results. I also checked the parameters of the models in both the trainer and pipeline and they are identical.