Trainer output much better than output from loaded model

I fine-tuned a model using HF, Datasets, and trainer. However, I’m now running into a kind of curious issue: my output from the QA trainer is excellent but loading the trained model into a pipeline gives me terrible results.
During training, I got a Rouge2 F1 of around .89, checkpointed the model, loaded from the best checkpoint and saved the model file. Now when I run the loaded model with the same tokenizer in a pipeline, I get terrible performance.

I used the QA Trainer from HF’s repo with very minor modifications – any thoughts?