I am pretty new here. So here is my problem description: I am trying to classify a sequence of document images. These documents could be between 1 and 10 pages long. I noticed there are some models such as LayoutLM are designed explicitly for document images, however, it seems it can only intake one image at a time. In our setting, we need multiple images since two different documents could contain similar images somewhere in them respectively. I can also use OCR to convert the images to texts with corresponding coordinates.
I came from Tensorflow world. In the past, I have been using its functional API to train a single model that can intake multiple inputs. But I am not sure how to do that with HuggingFace. Does anyone have experience in this type of problem?