How to Pass Image-Based Math/Geometry Problems to an LLM Without a Vision Model? (OCR Not Sufficient)

yavuzyildirim · January 23, 2025, 3:10pm

I have a project where I aim to build a system capable of solving math and geometry problems provided as images. The questions will be solved by an LLM (Large Language Model) named DeepSeek R1. However, DeepSeek R1 does not have a vision model, and I don’t have one either. I need to figure out how to pass these image-based questions to the LLM.

I’ve considered using OCR (Optical Character Recognition) systems, but they don’t work well for my case because OCR struggles to convert graphical elements (like diagrams or geometric shapes) into text-based formats. On the websites of large language models like DeepSeek, there are often options to upload images. How do these systems work? If anyone can provide guidance or suggestions, I would greatly appreciate it!

John6666 · January 24, 2025, 7:36am

I don’t know the details, but I think you could create something called a multimodal RAG.

Topic		Replies	Views
Can LayoutLM be used for images? Beginners	2	853	January 11, 2021
Multimodal LLM with Image and Text sequentially in its prompt 🤗Transformers	2	12658	January 1, 2024
Extract visual and contextual features from images Models	5	4451	August 27, 2021
Implement Phi-3-vision-128k-instruct as langchain agent Models	0	419	June 7, 2024
Image Features as Model Input Beginners	2	938	November 18, 2020

How to Pass Image-Based Math/Geometry Problems to an LLM Without a Vision Model? (OCR Not Sufficient)

Related topics