About extracting text information as well as relevant images from document likes pdf doc etc

Adarsh123 · March 4, 2024, 5:14pm

Use case is to extract the relevant text information along with images available in the file using generative ai, When any prompt is given then relevant text information and image should display as response.

Kindly help by providing some ideas, links or techniques.
Thank you.

FredS · May 5, 2024, 9:56am

Search for MultiModal RAG

nielsr · May 6, 2024, 6:58am

Yes indeed this is a multimodal RAG use case.

Something that often works well is summarizing the images using an image-text-to-text model (could be an open-source one like LlaVa, LlaVa-NeXT, Idefics2, etc. which are best among these at the time of writing, or a closed-source one like GPT-4V, Gemini).

Then you can embed the image summarizations just like regular text, store the embeddings in a vector database and perform regular RAG.

hugolb · July 27, 2024, 12:40pm

checkout this cookbook, seems very easy to do a multimodal RAG with this library:
https://github.com/ntropy-ai/ntropy/blob/main/examples/sp500-report/main.ipynb

Topic		Replies	Views
I need your opinion about Metadata Extraction Beginners	0	259	March 27, 2024
Looking for a model for text extraction from complex background Beginners	1	1937	April 22, 2024
Looking for OCR post-processing for Visual Document Understanding Research	0	636	December 15, 2023
Best route for text extraction from Invoice documents Beginners	3	876	July 3, 2025
CLIP Image to Text search Beginners	0	896	December 19, 2022

About extracting text information as well as relevant images from document likes pdf doc etc

Related topics