How to represent paginated documents as a single instance of training data for whole document classification?

The Huggingface Transformers library includes a number of document processing models that can do whole document classification. At least one of these models (LayoutLMv2) requires 3 inputs for each instance of training data:

  1. a resized image of the document,
  2. the words in the document
  3. and the word bounding boxes

(I suspect a number require these inputs). HF documentation provides a number of examples that support this use case, but I can’t find any that discuss paginated documents. Bounding boxes, for example, are based on the dimensions of a given page, so the paginated nature of the document needs to pass through HF Datasets, into torch and into training (for e.g. you can’t just concat all the paginated data). In essence, you need a HF Datasets representation and torch representation that encodes the paginated nature of the document and has a single label (if you’re doing classification). This was my naive idea at supporting paginated documents in HF datasets:

        features = Features({
            'image': Array4D(dtype="uint8", shape=(None, 3, 224, 224)),
            'input_ids': Array2D(dtype='int64', shape=(None, 512)),
            'attention_mask': Array2D(dtype='int64', shape=(None, 512)),
            'token_type_ids': Array2D(dtype='int64', shape=(None, 512)),
            'bbox': Array3D(dtype="int64", shape=(None, 512, 4)),
            'labels': ClassLabel(num_classes=len(unique_labels), names=unique_labels),

Here, every training data instance is represented as a matrix where the first dimension (the None) is the number of pages and the instance is given a single labels (the Processor uses the key labels so multi-label classification is supported). This data is loaded and passed into:

dataloader =, batch_size=None)

Which basically exploits batch_size as the representation of the pages. This seems to work until torch encounters the labels portion of the instance where a batch size isn’t provided because the label is supposed to represent the entirety of the batch:

ValueError: Expected input batch_size (57) to match target batch_size (1).

Anyways, I wanted to pass this idea and the greater question to the HF community: how do you represent paginated documents to HF datasets/transformers models?

Hi @plamb.

I have the same objective as you: classification of multi-page image documents (for example, PDF documents whose pages can be converted to images) by using - at the same time - both the layout and text.

@nielsr of HuggingFace works on Document Image Classification (see his github) but I did not find in this work a notebook/script that classifies from all the document pages.

  • DiT (paper):
    • performing inference with DiT for document image classification Open In Colab
  • LayoutLM (paper):
    • fine-tuning LayoutLMForSequenceClassification on the RVL-CDIP dataset Open In Colab
  • LayoutLMv2 (paper):
    • fine-tuning LayoutLMv2ForSequenceClassification on RVL-CDIP Open In Colab

About using LayoutLMv2 for one page document classification, there is also the publication of Karndeep Singh that looks similar to the one of @nielsr

I searched as well in about whole document classification and I found this paper “Towards a Multi-modal, Multi-task Learning based Pre-training Framework for Document Representation Learning” which was updated in January 2022 (see image below). Unfortunately, I did not find any associated code/notebook.

Back to LayoutLMv2 (or LayoutLMv3 now), how do you think we could use it for a multi-page document classification? @nielsr, have you already worked on thus subject? Thanks.

Thanks for this reply Pierre, I’m going to be digging back into this soon and will update the thread with anything I find

Just a thought: finetuning a siamese network with a LayouLMv2 as core model would help? (Siamese Neural Network.ipynb)

Quick explanation: each image page of the PDF document would be processed by a LayoutLMv2 model and the outputs embeddings would be combined through a linear layer (fully connected layer) in order to get one probability by category. The weights of this linear layer would be learned during the classification task finetuning at the same time than the layoutLMv2 weights would be updated by the backpropagation of the loss.

Spent a few hours mining the SOTA for this space, your link [2009.14457] Towards a Multi-modal, Multi-task Learning based Pre-training Framework for Document Representation Learning seems to be the best suited for what we’re describing, but like you said there is no available model. I’ve pinged the authors to see if anything is available.

1 Like

Two other options that might work:

Doc classification using just images (as you noted):, DiT

and long text classification using skim-attention:, GitHub - recitalAI/skim-attention: Supporting code for "Skim-Attention: Learning to Focus via Document Layout", EMNLP 2021

For my use case, the 512 sequence limit in LayoutLMv2 is a big problem

In terms of method, using DiT for text image classification is similar to the use of LayoutLMv2: you classify pages, not multi-page documents. We need to use these models in a larger pipeline that can weight logits predictions from the classification of each page.

It looks interesting as we can use longformer: does it means this Skimformer can evaluate many pages at once?

Hi Pierre, I heard back from the authors of [2009.14457] Towards a Multi-modal, Multi-task Learning based Pre-training Framework for Document Representation Learning who said this:

“This work is still not published and hence we have not released the code yet.”

So it looks like they plan to publish a model once it is complete.

Skimformer appears to be able to eval multiple pages at once; unfortunately they don’t have a pretrained model up for download: GitHub - recitalAI/skim-attention: Supporting code for "Skim-Attention: Learning to Focus via Document Layout", EMNLP 2021 but I’ve reached out to them as well to find out if they do somewhere.

At the moment, if you want to fine-tune a pre-trained model, it seems like text only-based classification is available via longformer, but there isn’t much else