The model I’m working with in particular is the Layoutlmv2 model
I’m attempting to finetune this model using a set of documents, each of which has a document classification label. Many of these documents have multiple pages. I am also attempting to use my own OCR results instead of the provided OCR. The documentation provides this code example for such a case:
from transformers import LayoutLMv2Processor from PIL import Image processor = LayoutLMv2Processor.from_pretrained("microsoft/layoutlmv2-base-uncased", revision="no_ocr") image = Image.open( "name_of_your_document - can be a png, jpg, etc. of your documents (PDFs must be converted to images)." ).convert("RGB") words = ["hello", "world"] boxes = [[1, 2, 3, 4], [5, 6, 7, 8]] # make sure to normalize your bounding boxes encoding = processor(image, words, boxes=boxes, return_tensors="pt")
So you provide a an image, a list of strings representing each word and a List[List[int]] representing each bounding box per word. But what if there are multiple pages? Now we have an image per page, a set of strings for each page and a set of bounding boxes for each page. My instinct was to provide a List[Image] for each page-image, a List[List[str]] where each internal List[str] is a page of text and then a List[List[List[int]]] for bounding boxes. However, the Processor does not have a case for this (it thinks a List[List[str]] represents a batch). So the question remains, how does one represent paginated instances?
I noticed that if i process my paginated document with
apply_ocr=True , the tokenizer is called for each page in the document. So what I need to do is iterate the images and call
processor(...) for each image + words + bbox. However, how should the returned encodings be concatenated together to represent a single document instance in the training set?