The Huggingface Transformers library includes a number of document processing models that can do whole document classification. At least one of these models (LayoutLMv2) requires 3 inputs for each instance of training data:
- a resized image of the document,
- the words in the document
- and the word bounding boxes
(I suspect a number require these inputs). HF documentation provides a number of examples that support this use case, but I can’t find any that discuss paginated documents. Bounding boxes, for example, are based on the dimensions of a given page, so the paginated nature of the document needs to pass through HF Datasets, into torch and into training (for e.g. you can’t just concat all the paginated data). In essence, you need a HF Datasets representation and torch representation that encodes the paginated nature of the document and has a single label (if you’re doing classification). This was my naive idea at supporting paginated documents in HF datasets:
features = Features({
'image': Array4D(dtype="uint8", shape=(None, 3, 224, 224)),
'input_ids': Array2D(dtype='int64', shape=(None, 512)),
'attention_mask': Array2D(dtype='int64', shape=(None, 512)),
'token_type_ids': Array2D(dtype='int64', shape=(None, 512)),
'bbox': Array3D(dtype="int64", shape=(None, 512, 4)),
'labels': ClassLabel(num_classes=len(unique_labels), names=unique_labels),
})
Here, every training data instance is represented as a matrix where the first dimension (the None
) is the number of pages and the instance is given a single labels
(the Processor uses the key labels
so multi-label classification is supported). This data is loaded and passed into:
dataloader = torch.utils.data.DataLoader(encoded_data, batch_size=None)
Which basically exploits batch_size
as the representation of the pages. This seems to work until torch encounters the labels
portion of the instance where a batch size isn’t provided because the label is supposed to represent the entirety of the batch:
ValueError: Expected input batch_size (57) to match target batch_size (1).
Anyways, I wanted to pass this idea and the greater question to the HF community: how do you represent paginated documents to HF datasets/transformers models?