Darshan Hiranandani : How to Create Datasets from PDF Files?

darshanhiranandani23 · January 16, 2025, 7:44am

Hi everyone,

I’m Darshan Hiranandani, looking for ways to extract text from PDF files and turn it into a well-structured question-and-answer dataset. Has anyone successfully done this, or does anyone have experience creating datasets from the text within PDF files?

Any advice, tools, or methods you’ve used for this process would be greatly appreciated!

Regards
Darshan Hiranandani

Thanks in advance!

John6666 · January 16, 2025, 2:05pm

We’re discussing this very topic on HF Discord, but it’s a bit long to copy and paste.

PDF2Datset

John6666 · January 17, 2025, 1:37am

And here.

Topic		Replies	Views
How do I create Datasets from PDF files? Beginners	8	1753	August 3, 2025
Pdf data set issues 🤗Datasets	0	611	November 17, 2022
Preparing datasets for NLP tasks 🤗Datasets	1	547	July 28, 2021
How to Train Models on AutoTrain using PDFs? 🤗Datasets	0	890	July 28, 2023
Table extraction from pdf Beginners	1	2920	July 6, 2022

Darshan Hiranandani : How to Create Datasets from PDF Files?

PDF2Datset

Related topics