Title: Recommendations for Models that Handle Text and Screenshots for QA

nhabibi · October 30, 2024, 8:06am

Hi Hugging Face Community,

I’m looking for models that can process documents containing both text and related screenshots of software, along with a prompt, to create a question-answering (QA) system.

I think “image-text-to-text” models should be able to do this, but they often seem to focus primarily on images.

What models are best suited for this task?

Thanks!

John6666 · October 30, 2024, 10:12am

I guess these models are the most famous ones.

nhabibi · October 30, 2024, 12:07pm

Dear @John6666

Many thanks for your quick reply.

I tried these two models:
The Qwen model only accepts images as input.
The Llama model gives error on my docx.

I wonder if there is any model that accept a docx file, containing both text and images, and respond to a related given prompt.

Thanks anyway
Regards

John6666 · October 30, 2024, 12:17pm

I’ve seen PDFs from time to time, but very few models or spaces that have Microsoft Word files as input.
Proprietary format files are usually converted themselves and then passed to the VLM or LLM. If you let the generative AI do the conversion itself, it will eat machine power unnecessarily…

Edit:
I found it.

nhabibi · October 30, 2024, 4:40pm

Dear @John6666

Many thanks for your attention and help

I tried the the first option (PDF Chatbot). Seem interesting.

The point is that if this app and the similar ones, consider the relationship between each image and the corresponding text paragraph, or just separately extract text paragraphs, and images OCRs, and combine all, without considering the ordering and relation?

John6666 · October 30, 2024, 11:16pm

https://python.langchain.com/api_reference/community/document_loaders/langchain_community.document_loaders.pdf.PyPDFLoader.html
Good question. This guy seems to be using PyPDFLoader from the LangChain library internally, so it’s more like text handling than OCR from images.
As to why it’s a good question, I found this. I think it’s almost finished.
https://python.langchain.com/api_reference/community/document_loaders/langchain_community.document_loaders.word_document.Docx2txtLoader.html

Edit:
Tips on how to work with LangChain and HF.

nhabibi · November 4, 2024, 9:07am

Thanks @John6666 for your attention.

So I guess that there is no any available tool/lib yet to do this.

John6666 · November 4, 2024, 9:19am

It may not be in the form of a tool yet.
LangChain is a well-known library, though not HF’s. LangChain seems to have a Word file processing feature as well as PDF.
I wonder if it would be possible to change the PDF file processing part of the PDF chatbot to Word file processing.
Of course, this is only the general framework of what can be achieved, but the details will have to be adjusted.

nhabibi · November 4, 2024, 9:34am

Thanks @John6666 . Yes LangChain is a great framework, and I tried to use it for my task. But again I could not find any option to merge text and image processing.

John6666 · November 4, 2024, 9:43am

I see. In this use case, it would have to be an image-compatible one for VLM or VLLM…
I can assure you that I have not seen that.
It might be faster to convert it to PDF once and then to image.
Unlike the old doc files, docx files have a simple structure like XML and resource files zipped together, but that’s because you need to know Word’s own rules to render them properly. It would be very difficult to do it by yourself.

nhabibi · November 4, 2024, 9:58am

Converting to PDF and then to image, that is the option I have not tried yet. Thanks for your suggestion @John6666

nhabibi · November 5, 2024, 8:23am

Thanks @John6666 for your last suggestion. It worked (except some issues with OCR of the images inside the PDF).

John6666 · November 5, 2024, 8:29am

That’s good! If you follow the posts below about PDF layout issues, there may be some progress in the future.
Word files are usually business documents, so they don’t have as many unique formats as academic papers, but I think the basic issues are the same. And there is always demand for improvements to OCR.

nhabibi · November 5, 2024, 8:38am

Good resource

DylanAndrew · November 6, 2024, 11:53am

For a QA system with both text and screenshots, try models like BLIP or OFA—they handle both formats well. You can also look into Vision-Language models like CLIP paired with a text model on Hugging Face. Great starting poin

nhabibi · November 7, 2024, 8:39am

Thanks @DylanAndrew for your reply. I will explore them as well.

Topic		Replies	Views
Any Multi Modal LLMs that take direct pdf + text as input? 🤗Transformers	2	1925	October 10, 2024
Uploading TXT, DOCX, MD, etc. to Chat Beginners	1	16	June 20, 2025
Read data of pdf or just image format as a part of promt Intermediate	0	1336	May 29, 2023
Google Document AI Alternative 🤗Transformers	3	904	October 6, 2024
Chat with PDF locally using Llama 3 Show and Tell	0	4357	May 5, 2024

Related topics