LlamaIndex for PDF parsing

GauravM17 · August 26, 2024, 7:02am

Hi All,
I am developing an application which can parse the structured tables in PDF lets say in markdown and then we can do the RAG implementation. I am seeing Llamaindex as a powerful framework but I see I need to use the paid version. The limitation I see if it takes the document and takes into lamacloud…I am looking for an option which do not require to load the docs into cloud and can parse locally only…any suggestions on this?

nielsr · August 26, 2024, 1:42pm

Hi,

it depends on how complex the layout in the PDFs are. If the PDFs are native PDFs, then you could use a library like PyPDF. If that’s not the case (PDFs are scanned), then you could use an OCR solution. Open-source solutions include Tesseract.

There are also newer openly available models which can turn PDFs into Markdown, such as:

Nougat (note, Meta put a non-commercial license on this one)
KOSMOS-2.5 (will be soon part of the Transformers library)
GitHub - VikParuchuri/surya: OCR, layout analysis, reading order, line detection in 90+ languages which is great as well

GauravM17 · August 27, 2024, 9:39am

Thanks @nielsr …what about LlamaIndex?

Topic		Replies	Views
LLM model for table data Languages at Hugging Face	8	41189	July 21, 2024
Extraction of tabular data from a PDF Beginners	0	64	May 6, 2025
Chatting with pdf (with reasoning capabilities) Beginners	2	192	February 4, 2025
We added LLaMA on ChatPDF Show and Tell	0	841	February 13, 2024
Using a paid inference end point to query llamaindex knowledge graph gives worse results than the free inference api Beginners	2	720	March 8, 2024

LlamaIndex for PDF parsing

Related topics