How to represent paginated documents as a single training data instance

The model I’m working with in particular is the Layoutlmv2 model

I’m attempting to finetune this model using a set of documents, each of which has a document classification label. Many of these documents have multiple pages. I am also attempting to use my own OCR results instead of the provided OCR. The documentation provides this code example for such a case:

from transformers import LayoutLMv2Processor
from PIL import Image

processor = LayoutLMv2Processor.from_pretrained("microsoft/layoutlmv2-base-uncased", revision="no_ocr")

image =
    "name_of_your_document - can be a png, jpg, etc. of your documents (PDFs must be converted to images)."
words = ["hello", "world"]
boxes = [[1, 2, 3, 4], [5, 6, 7, 8]]  # make sure to normalize your bounding boxes
encoding = processor(image, words, boxes=boxes, return_tensors="pt")

So you provide a an image, a list of strings representing each word and a List[List[int]] representing each bounding box per word. But what if there are multiple pages? Now we have an image per page, a set of strings for each page and a set of bounding boxes for each page. My instinct was to provide a List[Image] for each page-image, a List[List[str]] where each internal List[str] is a page of text and then a List[List[List[int]]] for bounding boxes. However, the Processor does not have a case for this (it thinks a List[List[str]] represents a batch). So the question remains, how does one represent paginated instances?


I noticed that if i process my paginated document with apply_ocr=True , the tokenizer is called for each page in the document. So what I need to do is iterate the images and call processor(...) for each image + words + bbox. However, how should the returned encodings be concatenated together to represent a single document instance in the training set?

For anyone that comes across this, I wanted to link this bug report i wrote which might save them some time.


I’m still not 100% sure if I am generating inputs for the model correctly based on paginated data, but I will update the thread when I find out

Note: after doing some research, I created a more well formulated question here: How to represent paginated documents as a single instance of training data for whole document classification?