LlamaIndex for PDF parsing

Hi All,
I am developing an application which can parse the structured tables in PDF lets say in markdown and then we can do the RAG implementation. I am seeing Llamaindex as a powerful framework but I see I need to use the paid version. The limitation I see if it takes the document and takes into lamacloud…I am looking for an option which do not require to load the docs into cloud and can parse locally only…any suggestions on this?

Hi,

it depends on how complex the layout in the PDFs are. If the PDFs are native PDFs, then you could use a library like PyPDF. If that’s not the case (PDFs are scanned), then you could use an OCR solution. Open-source solutions include Tesseract.

There are also newer openly available models which can turn PDFs into Markdown, such as:

Thanks @nielsr …what about LlamaIndex?