Fine tuning LayoutLMv2 For Token Classification on CORD dataset

I use this colab:

to Fine tuning LayoutLMv2ForTokenClassification on CORD dataset

here is the result:

  • F1: 0.9665

and indeed the result are pretty amazing when running on the test set
how ever, when running on any other receipt (printed or pdf) the result are completely off

So from some reason the model is overfitting, to the cord dataset, even though I use similar images for testing.
I don’t think that there is a Data leakage unless the cord DS is not clean (which I assume it is clean)

What could be the reason for this?
Is it some inherent property of LayoutLM?
The LayoutLM models are somewhat old, and it seems deserted…

I don’t have much experience so I would appreciate any info
Thanks