About extracting text information as well as relevant images from document likes pdf doc etc

Use case is to extract the relevant text information along with images available in the file using generative ai, When any prompt is given then relevant text information and image should display as response.

Kindly help by providing some ideas, links or techniques.
Thank you.

Search for MultiModal RAG

Yes indeed this is a multimodal RAG use case.

Something that often works well is summarizing the images using an image-text-to-text model (could be an open-source one like LlaVa, LlaVa-NeXT, Idefics2, etc. which are best among these at the time of writing, or a closed-source one like GPT-4V, Gemini).

Then you can embed the image summarizations just like regular text, store the embeddings in a vector database and perform regular RAG.