Image to text models tailored for web scraping?

Hello everyone👋! My task is to recognize text on typical layouts. These are screenshots of product webpages from a supplier’s site. I recognize product names, prices, and specifications. I am using GPT-4 vision, and it works great. On average, my token usage looks like this: Prompt token: 1000, Completion token: 300. I tried using a combination of OCR (Tesseract) + LLM (providing the LLM with the OCR-recognized text directly), but this didn’t significantly reduce costs (especially if the language is not English). My question is: In which direction should I experiment to make this process much more cost-effective without noticeable quality loss? I suspect that the most reliable option is to find an image-to-text model on Huggingface that works reasonably well out of the box and fine-tune it with my data. Or maybe there are existing models tailored for webscraping?


Current recommendations that would work well for this use case include:

Tutorials regarding fine-tuning are available in my Transformers Tutorials repo, see e.g. Transformers-Tutorials/PaliGemma at master · NielsRogge/Transformers-Tutorials · GitHub for PaliGemma.

Next to that, some other powerful models which might work well: