Getting unexpected results for fine tuned bert model

Hi, I am new to NLP domain and particularly huggingface. I recently learnt finetuning of models so decided to try it out. I fine tuned bert-base-cased model on news sentiment dataset. Everyting went well and I got a validation accuracy of around 85%. I checked the confusion matrix and the model is performing well.

However, when I tried to test it on some outside data (I passed some test news examples) the model is giving unexpected results. For straightforward negative news, it is showing a high positive score and for negative news it is showing high positive score. I initially thought it might be the issue with target encoding but turns out the results are totally vague. Sometimes the positive news is correctly identified but most of the time the results are totally unexpected.

Could someone please explain what could be the reasons behind this and what steps I need to take to analyse the problem?