Hi, we are working on a multi agent tool to automate the recognition of tax receipts. We upload a pdf of e.g. 20 pages and then need to split the pages into individual documents. We assume that all pages of a receipt are grouped together, but there order might be shuffled. (e.g. receipt 1 page 1, receipt 1 page 2, receipt 2 page 1, receipt 3 page 2, receipt 2 page 3, receipt 2 page 1, …) After the grouping we need to identify the type of each receipt, e.g. capital income or handyman invoice. In the last step we extract the relevant values for the tax bill. e.g. the individual types of capital income or the individual position of the handyman bill. These can be used for further processing by an LLM. The whole process works nearly perfectly if we use chat GPT, but we cannot upload the receipts to a public cloud due to data security issues. If we use local LLMs, the results - especially for the vision / OCR part are not that promising. Also it is very time and cost intensive to calculate this. We are thinking of running a private cloud in a data center, but costs are really high, as the token amount per page can easily exceed 5000 tokens. what would be your suggestion? Does it make sense to use open-source python code lik pdf2image or tesseract, tabula or camelot. what are the biggest risks in your point of view?
The problem ist that it neds to be provn that it works! We got some ideas from friends -e.g. yolo or paddle-ocr. What would be your advice?
Regards, Christian