Effect on varing Context length in QA system

I am trying a QA answer system for Entity Linking (I implemented ENTQA paper). To improve the model i am trying to merge two datasets (LCQUAD and T-REx). I trained only with T-REx (paragraphs with entities in it)which worked quite good. but T-REx annotation is not accurate, there are span of words which could have been marked as entities but they are not. to tackle that i merged LCQUAD which is questions only with accurate annotation of entities in the text.

I am fine-tuning deepset/minilm-uncased-squad2 model. with training on merged dataset the performance decreases for longer texts. and model stops detecting any entity in the text, where as for smaller text the performance is also not very good.

I am unable to understand why is it happening!