Multi-input classification (images + Texts)

uniflow · September 2, 2022, 6:15am

Hi folks,

I am pretty new here. So here is my problem description: I am trying to classify a sequence of document images. These documents could be between 1 and 10 pages long. I noticed there are some models such as LayoutLM are designed explicitly for document images, however, it seems it can only intake one image at a time. In our setting, we need multiple images since two different documents could contain similar images somewhere in them respectively. I can also use OCR to convert the images to texts with corresponding coordinates.

I came from Tensorflow world. In the past, I have been using its functional API to train a single model that can intake multiple inputs. But I am not sure how to do that with HuggingFace. Does anyone have experience in this type of problem?

Thank you!

dinhanhx · September 2, 2022, 8:45am

Uhm HuggingFace Transformers simply is a collection of models written with PyTorch, Tensorflow, and Flax. So yeah you just take a model source code you want then plug in whichever mode you want.

uniflow · September 2, 2022, 6:42pm

Thank you! Is it possible to load the pre-trained weight that way?

dinhanhx · September 3, 2022, 9:47am

Yeah but you might have to mind about keys in the state dict (assuming you use PyTorch)

uniflow · September 4, 2022, 6:25am

Actually, I use Tensorflow. Do you have any tutorials I can read about? Sorry I am completely new to Huggingface. Appreciate your help and patience.

dinhanhx · September 4, 2022, 7:15am

Sorry I am not familiar with Tensorflow enough try to read their doc more

thefaheem · February 18, 2024, 4:20pm

Hey @uniflow do you found a solution yet?

Topic		Replies	Views
How to represent paginated documents as a single instance of training data for whole document classification? 🤗Transformers	7	2096	May 27, 2022
Multiple texts as inputs to Transformers models 🤗Transformers	9	10028	September 13, 2024
Multi-page Document Classification Models	4	2753	August 5, 2025
Using Huggingface for computer vision (Tensorflow)? 🤗Transformers	3	415	June 2, 2025
How to represent paginated documents as a single training data instance 🤗Transformers	2	615	May 16, 2022

Multi-input classification (images + Texts)

Related topics