I've built a PDF to Dataset tool with Python would love your feedback

ayabongaqwabi · August 3, 2025, 6:59pm

Hey guys I’m thrilled to share a Python tool I’ve developed that converts history books (PDFs) into high-quality Q&A datasets for AI training. It leverages Ollama for local AI model inference, includes features like PDF text extraction, historical content filtering, and deduplication, and is customizable for any PDF book!

Key features:

AI-powered Q&A generation with models like Llama 3.1 or Mistral
Customizable keywords for domain-specific content
Parallel processing and resume capability for efficiency (Using CPU, I didnt know how to make python use GPU)
JSONL output format for easy integration
Check out the full details On this Github Repo .

I’d love feedback, suggestions, or contributions to make this tool even better!

Topic		Replies	Views
How do I create Datasets from PDF files? Beginners	8	1825	August 3, 2025
Read data of pdf or just image format as a part of promt Intermediate	0	1361	May 29, 2023
Generate dataset for fine tuning on PDF(s) 🤗Transformers	7	3903	August 3, 2025
JSON response for pdf text data Beginners	1	591	June 10, 2024
Train/finetune llm to anwer a set of questions in unstructured pdfs Beginners	1	1039	April 9, 2024

I've built a PDF to Dataset tool with Python would love your feedback

Related topics