Multi-input classification (images + Texts)

Hi folks,

I am pretty new here. So here is my problem description: I am trying to classify a sequence of document images. These documents could be between 1 and 10 pages long. I noticed there are some models such as LayoutLM are designed explicitly for document images, however, it seems it can only intake one image at a time. In our setting, we need multiple images since two different documents could contain similar images somewhere in them respectively. I can also use OCR to convert the images to texts with corresponding coordinates.

I came from Tensorflow world. In the past, I have been using its functional API to train a single model that can intake multiple inputs. But I am not sure how to do that with HuggingFace. Does anyone have experience in this type of problem?

Thank you!

Uhm HuggingFace Transformers simply is a collection of models written with PyTorch, Tensorflow, and Flax. So yeah you just take a model source code you want then plug in whichever mode you want.

Thank you! Is it possible to load the pre-trained weight that way?

Yeah but you might have to mind about keys in the state dict (assuming you use PyTorch)

Actually, I use Tensorflow. Do you have any tutorials I can read about? Sorry I am completely new to Huggingface. Appreciate your help and patience.

Sorry I am not familiar with Tensorflow enough try to read their doc more