I've built a PDF to Dataset tool with Python would love your feedback

Hey guys I’m thrilled to share a Python tool I’ve developed that converts history books (PDFs) into high-quality Q&A datasets for AI training. It leverages Ollama for local AI model inference, includes features like PDF text extraction, historical content filtering, and deduplication, and is customizable for any PDF book!

Key features:

  • AI-powered Q&A generation with models like Llama 3.1 or Mistral
  • Customizable keywords for domain-specific content
  • Parallel processing and resume capability for efficiency (Using CPU, I didnt know how to make python use GPU)
  • JSONL output format for easy integration
    Check out the full details On this Github Repo .

I’d love feedback, suggestions, or contributions to make this tool even better!

2 Likes