Title: Recommendations for Models that Handle Text and Screenshots for QA

Hi Hugging Face Community,

I’m looking for models that can process documents containing both text and related screenshots of software, along with a prompt, to create a question-answering (QA) system.

I think “image-text-to-text” models should be able to do this, but they often seem to focus primarily on images.

What models are best suited for this task?

Thanks!

1 Like

I guess these models are the most famous ones.

1 Like

Dear @John6666

Many thanks for your quick reply.

I tried these two models:
The Qwen model only accepts images as input.
The Llama model gives error on my docx.

I wonder if there is any model that accept a docx file, containing both text and images, and respond to a related given prompt.

Thanks anyway
Regards

I’ve seen PDFs from time to time, but very few models or spaces that have Microsoft Word files as input.
Proprietary format files are usually converted themselves and then passed to the VLM or LLM. If you let the generative AI do the conversion itself, it will eat machine power unnecessarily…

Edit:
I found it.

1 Like

Dear @John6666

Many thanks for your attention and help :slight_smile:

I tried the the first option (PDF Chatbot). Seem interesting.

The point is that if this app and the similar ones, consider the relationship between each image and the corresponding text paragraph, or just separately extract text paragraphs, and images OCRs, and combine all, without considering the ordering and relation?

1 Like

https://python.langchain.com/api_reference/community/document_loaders/langchain_community.document_loaders.pdf.PyPDFLoader.html
Good question. This guy seems to be using PyPDFLoader from the LangChain library internally, so it’s more like text handling than OCR from images.
As to why it’s a good question, I found this. I think it’s almost finished.
https://python.langchain.com/api_reference/community/document_loaders/langchain_community.document_loaders.word_document.Docx2txtLoader.html

Edit:
Tips on how to work with LangChain and HF.

1 Like

Thanks @John6666 for your attention.

So I guess that there is no any available tool/lib yet to do this.

1 Like

It may not be in the form of a tool yet.
LangChain is a well-known library, though not HF’s. LangChain seems to have a Word file processing feature as well as PDF.
I wonder if it would be possible to change the PDF file processing part of the PDF chatbot to Word file processing.
Of course, this is only the general framework of what can be achieved, but the details will have to be adjusted.

1 Like

Thanks @John6666 . Yes LangChain is a great framework, and I tried to use it for my task. But again I could not find any option to merge text and image processing.

I see. In this use case, it would have to be an image-compatible one for VLM or VLLM…
I can assure you that I have not seen that.
It might be faster to convert it to PDF once and then to image.
Unlike the old doc files, docx files have a simple structure like XML and resource files zipped together, but that’s because you need to know Word’s own rules to render them properly. It would be very difficult to do it by yourself.

1 Like

Converting to PDF and then to image, that is the option I have not tried yet. Thanks for your suggestion @John6666

1 Like

Thanks @John6666 for your last suggestion. It worked (except some issues with OCR of the images inside the PDF).

1 Like

That’s good!:grinning: If you follow the posts below about PDF layout issues, there may be some progress in the future.
Word files are usually business documents, so they don’t have as many unique formats as academic papers, but I think the basic issues are the same. And there is always demand for improvements to OCR.

1 Like

Good resource :ok_hand:

1 Like

For a QA system with both text and screenshots, try models like BLIP or OFA—they handle both formats well. You can also look into Vision-Language models like CLIP paired with a text model on Hugging Face. Great starting poin

1 Like