How to represent paginated documents as a single training data instance

plamb · May 11, 2022, 2:01pm

The model I’m working with in particular is the Layoutlmv2 model

I’m attempting to finetune this model using a set of documents, each of which has a document classification label. Many of these documents have multiple pages. I am also attempting to use my own OCR results instead of the provided OCR. The documentation provides this code example for such a case:

from transformers import LayoutLMv2Processor
from PIL import Image

processor = LayoutLMv2Processor.from_pretrained("microsoft/layoutlmv2-base-uncased", revision="no_ocr")

image = Image.open(
    "name_of_your_document - can be a png, jpg, etc. of your documents (PDFs must be converted to images)."
).convert("RGB")
words = ["hello", "world"]
boxes = [[1, 2, 3, 4], [5, 6, 7, 8]]  # make sure to normalize your bounding boxes
encoding = processor(image, words, boxes=boxes, return_tensors="pt")

So you provide a an image, a list of strings representing each word and a List[List[int]] representing each bounding box per word. But what if there are multiple pages? Now we have an image per page, a set of strings for each page and a set of bounding boxes for each page. My instinct was to provide a List[Image] for each page-image, a List[List[str]] where each internal List[str] is a page of text and then a List[List[List[int]]] for bounding boxes. However, the Processor does not have a case for this (it thinks a List[List[str]] represents a batch). So the question remains, how does one represent paginated instances?

EDIT

I noticed that if i process my paginated document with apply_ocr=True , the tokenizer is called for each page in the document. So what I need to do is iterate the images and call processor(...) for each image + words + bbox. However, how should the returned encodings be concatenated together to represent a single document instance in the training set?

plamb · May 14, 2022, 6:00pm

For anyone that comes across this, I wanted to link this bug report i wrote which might save them some time.

huggingface/datasets#4352

I’m still not 100% sure if I am generating inputs for the model correctly based on paginated data, but I will update the thread when I find out

plamb · May 16, 2022, 2:08pm

Note: after doing some research, I created a more well formulated question here: How to represent paginated documents as a single instance of training data for whole document classification?

Topic		Replies	Views
How to represent paginated documents as a single instance of training data for whole document classification? 🤗Transformers	7	2086	May 27, 2022
Multi-page Document Classification Models	3	2697	March 22, 2024
How to extract text using LayoutLM2 Beginners	0	1200	June 7, 2022
Multi-input classification (images + Texts) Beginners	6	1140	February 18, 2024
Layoutlmv2 token classification on documents having tokens larger than 512 Models	8	2315	October 20, 2022

How to represent paginated documents as a single training data instance

Related topics