Fine-Tuning a Language Model with Data Extracted from Multiple PDFs for a Chat Interface

Hi everyone,

I’m embarking on a project where I aim to fine-tune a language model (LM) using data extracted from multiple PDF documents. My goal is to create an interactive chatbot that can understand queries and return relevant information directly from the content contained within these PDFs.

Before diving deep, I wanted to ask the community a few questions:

  1. Has anyone here successfully fine-tuned a large language model (LLM) with data extracted from PDFs? I’m interested in any challenges you might have faced, especially regarding data preprocessing and format conversion.
  2. Is there a notebook or tutorial available that outlines a simple approach to fine-tuning an LLM with PDF-derived data? Ideally, this guide would cover the entire process from text extraction to model fine-tuning and deployment.
  3. For the chat interface, I’m curious about best practices for integrating the fine-tuned model so it can efficiently search through and reference the PDF content in response to user queries. Any advice or examples would be greatly appreciated.

I’m hoping to leverage the capabilities of Hugging Face’s transformers for this project. However, I’m open to suggestions on other tools or methods that might be well-suited for this kind of application.

Thank you in advance for your insights and assistance!

Best regards,

1 Like

Hello, I’m interested in this same learning task, if you have better answers could you help me too?

I put my doubts in this post: Chatbot PDF - Only local

1 Like

Hi, Did u make such model and how did u use the pdfs data? Esprciallyw hen it has image embedded tabular data.