How to represent paginated documents as a single instance of training data for whole document classification?

Hi @plamb.

I have the same objective as you: classification of multi-page image documents (for example, PDF documents whose pages can be converted to images) by using - at the same time - both the layout and text.

@nielsr of HuggingFace works on Document Image Classification (see his github) but I did not find in this work a notebook/script that classifies from all the document pages.

  • DiT (paper):
    • performing inference with DiT for document image classification Open In Colab
  • LayoutLM (paper):
    • fine-tuning LayoutLMForSequenceClassification on the RVL-CDIP dataset Open In Colab
  • LayoutLMv2 (paper):
    • fine-tuning LayoutLMv2ForSequenceClassification on RVL-CDIP Open In Colab

About using LayoutLMv2 for one page document classification, there is also the publication of Karndeep Singh that looks similar to the one of @nielsr

I searched as well in arxiv.org about whole document classification and I found this paper “Towards a Multi-modal, Multi-task Learning based Pre-training Framework for Document Representation Learning” which was updated in January 2022 (see image below). Unfortunately, I did not find any associated code/notebook.

Back to LayoutLMv2 (or LayoutLMv3 now), how do you think we could use it for a multi-page document classification? @nielsr, have you already worked on thus subject? Thanks.