How to Pass Image-Based Math/Geometry Problems to an LLM Without a Vision Model? (OCR Not Sufficient)

I have a project where I aim to build a system capable of solving math and geometry problems provided as images. The questions will be solved by an LLM (Large Language Model) named DeepSeek R1. However, DeepSeek R1 does not have a vision model, and I don’t have one either. I need to figure out how to pass these image-based questions to the LLM.

I’ve considered using OCR (Optical Character Recognition) systems, but they don’t work well for my case because OCR struggles to convert graphical elements (like diagrams or geometric shapes) into text-based formats. On the websites of large language models like DeepSeek, there are often options to upload images. How do these systems work? If anyone can provide guidance or suggestions, I would greatly appreciate it!

1 Like

I don’t know the details, but I think you could create something called a multimodal RAG.