Anyone else run into trouble tagging named entities from scanned newspaper archives? OCR works okay, but once you throw in inconsistent layouts or multilingual headlines, NER models start falling apart. I’ve seen place names pulled from ad sections and people’s names chopped mid-sentence because of bad line breaks or weird fonts.
We tried using a fine-tuned transformer, but the results were all over the place. What helped a bit was adding a layout-aware filter step inside our pipeline. We’re using this setup in Collatio Digital Archive, where the system first separates editorial zones from ads and metadata, then runs tagging only on the clean segments. That actually helped us avoid most of the junk entity hits and improved the consistency of tags across long-form pieces.
Still feels like there’s more to fix, especially with older or poorly scanned content. Has anyone here gotten decent NER results from messy historical data? Did layout context or post-OCR filtering play a role in your approach?