Fine-Tuning a Language Model with Data Extracted from Multiple PDFs for a Chat Interface

ZZTestteur · April 6, 2024, 6:09pm

Hi everyone,

I’m embarking on a project where I aim to fine-tune a language model (LM) using data extracted from multiple PDF documents. My goal is to create an interactive chatbot that can understand queries and return relevant information directly from the content contained within these PDFs.

Before diving deep, I wanted to ask the community a few questions:

Has anyone here successfully fine-tuned a large language model (LLM) with data extracted from PDFs? I’m interested in any challenges you might have faced, especially regarding data preprocessing and format conversion.
Is there a notebook or tutorial available that outlines a simple approach to fine-tuning an LLM with PDF-derived data? Ideally, this guide would cover the entire process from text extraction to model fine-tuning and deployment.
For the chat interface, I’m curious about best practices for integrating the fine-tuned model so it can efficiently search through and reference the PDF content in response to user queries. Any advice or examples would be greatly appreciated.

I’m hoping to leverage the capabilities of Hugging Face’s transformers for this project. However, I’m open to suggestions on other tools or methods that might be well-suited for this kind of application.

Thank you in advance for your insights and assistance!

Best regards,

EikESousA · April 15, 2024, 12:52am

Hello, I’m interested in this same learning task, if you have better answers could you help me too?

I put my doubts in this post: Chatbot PDF - Only local

tripti27 · November 5, 2024, 11:58am

Hi, Did u make such model and how did u use the pdfs data? Esprciallyw hen it has image embedded tabular data.

Topic		Replies	Views
How to train a model to extract specific data from PDFs? Beginners	2	2792	January 30, 2025
Generate dataset for fine tuning on PDF(s) 🤗Transformers	6	3262	September 3, 2024
LLM model for table data Languages at Hugging Face	8	41218	July 21, 2024
Fundamental newbie questions Beginners	1	1335	December 6, 2020
Fine tune LLMs on PDF Documents Models	29	31573	March 3, 2025

Fine-Tuning a Language Model with Data Extracted from Multiple PDFs for a Chat Interface

Related topics