Fine tune LLMs on PDF Documents

imvbhuvan · January 31, 2024, 5:59pm

We are currently seeking assistance in fine-tuning the Mistral model using approximately 48 PDF documents. Specifically, our challenge lies in training the model using peft and preparing the documents for optimal fine-tuning. We are facing difficulties in locating suitable resources for this task, and we are also uncertain about the proper procedures for document preparation, storage, and supply.

If anyone within the community has expertise in this area or can provide guidance on the aforementioned aspects, we would greatly appreciate your assistance. Your insights and recommendations would be invaluable to our project.

juhoinkinen · February 2, 2024, 7:29pm

I assume you want to extract raw text from the PDFs? In what kind of form you want the data for fine-tuning to be?

Here’s a link to one Jupyter notebook of our pipeline for experiments to fine-tune OpenAI models based on PDFs and bibliographic ground-truth metadata; it uses PyMuPDF for text extraction (imported with the name fitz).

imvbhuvan · February 3, 2024, 4:37pm

Thank you for your response.

We aim to customize the LLMs for a specific domain by fine-tuning them using approximately 50 books. This process will enhance the model’s understanding of the domain’s nuances and potentially expand its vocabulary. However, my team and I lack knowledge on how to effectively store and process these PDFs for the LLM, as existing online resources primarily discuss instruction fine-tuning and other methods. Any help and guidance will be deeply remembered.

sabber · April 3, 2024, 2:27pm

Hello there @imvbhuvan , Were you able to fine tune model using pdf (I assume unstructured data) ? I am also facing similar challenges. I have some pdfs and html website data and lack formatted structure. But the goal is to fine tune the model so it has the ability to understand the domain.

sabber · April 4, 2024, 4:57pm

Thank you very much for the reply. I will email you shortly!!

sagekhan · April 10, 2024, 5:37pm

Hi. Im also trying to fine tune mistral on some documents. Actually its text file extracted from 1-5 page pdf which will be context, then some questions on it and another txt file with rather longer structured form of answer on it (csv output). How did you create the dataset?

omar8 · May 1, 2024, 8:46am

did you find a way to do it?’

nielsr · May 1, 2024, 1:02pm

There are 2 options here I’d say:

either you fine-tune a text-only LLM (like Mistral, LLaMa, etc.) on the OCR text of the PDF along with a text prompt and corresponding targets. Refer to my notebook on fine-tuning Mistral-7B (or any other LLM) on a custom dataset: Transformers-Tutorials/Mistral/Supervised_fine_tuning_(SFT)_of_an_LLM_using_Hugging_Face_tooling.ipynb at master · NielsRogge/Transformers-Tutorials · GitHub. It leverages Q-LoRa with the PEFT library. It’s based on the official scripts from the Alignment Handbook: GitHub - huggingface/alignment-handbook: Robust recipes to align language models with human and AI preferences. Basically one needs to prepare the dataset in a format that the model expects (by calling tokenizer.apply_chat_template on the inputs). What I’d recommend is applying an OCR engine of choice on the PDF documents (such as Tesseract, or closed-source APIs like the ones from Google Cloud or Azure).
either you fine-tune a vision-language model (like Idefics2, Llava, Llava-NeXT, PaliGemma) to take in the image(s) of the PDF, typically one image per page, + an optional text prompt, and produce a certain desired target. One example could be take in a PDF and produce corresponding JSON, containing desired keys and values from the PDF. Refer to the example notebook regarding fine-tuning PaliGemma or the example notebook regarding fine-tuning Idefics2. The benefit of Idefics2 is that it encodes images very efficiently which means you could easily fine-tune it on PDFs which consists of multiple pages.

The latter approach is basically this:

Update: more and more VLMs now support fine-tuning on multi-page documents. Some recent example include Qwen2-VL and LLaVa-OneVision.

sagekhan · May 1, 2024, 3:16pm

Make your own dataset and train on it. Im facing some issues with The excessive replies and stuff. Lit tends to give the reply in duplicates . But with greater dataset we can ensure improved performance. Convert all data in text using -layout option in pdf2text and then fine tune it using autotrain.

AnasLA · May 29, 2024, 1:21pm

Thank you for this information !

In general, when you talk about PDF to JSON, in the notebook we find image to JSON (it’s like converting multiple PDFs to a single image).

But in my case, I have a PDF with multiple pages. The PDF is divided into paragraphs, each with its own properties. The problem is that if I convert the PDF into a single image, it risks losing information since a paragraph might span two images.

My goal is to convert the entire PDF into one JSON file.

So, I wonder if it is possible to divide a PDF into multiple images that will be given as input, and the output will be a single JSON file. If yes, how?

UpendraKatara12 · September 10, 2024, 2:23pm

Hi Bhuvan were able to succeed and perform well inferences on the same task. I am also researching for the same did not find a great way out as of yet. Can you help me in the same.

Triptigarg2711 · October 23, 2024, 8:06am

HI Sabber, I also face similar problem, I have some 100s of pdfs on which I want to train/finetune llm. The thing is, when I train it on raw text, pdf converted to text data, it is not very well formatted and hence, the llm so finetuned, is not able to work correctly. Please help if u know some solution. Also, the pdfs have tables and pymuPDF, tabula etc are not able to extract tabular data from such pdfs. Please help if u know some solution.

Triptigarg2711 · October 23, 2024, 9:29am

Hi, I also face similar problem, I have some 100s of pdfs on which I want to train/finetune llm. The thing is, when I train it on raw text, pdf converted to text data, it is not very well formatted and hence, the llm so finetuned, is not able to work correctly. Please help if u know some solution. Also, the pdfs have tables and pymuPDF, tabula etc are not able to extract tabular data from such pdfs. Please help if u know some solution.

imvbhuvan · November 5, 2024, 9:16am

You can use continued pre-training

tripti27 · November 5, 2024, 12:00pm

Thanks. But can u please elaborate the process of how to use pdf data. The problem is, I have tables in pdfs and mostly are embedded as image. So even if an ocr extracts tabular pdf data into text how to use them in training. Did u simply just put entire text files into training? Thanks in advance

leobg · November 13, 2024, 8:06am

Sounds to me like there is a misunderstanding going on.

When you say “fine tune on PDF documents”, it sounds like you’re trying to train an LLM to extract text from an image-based PDF file. Basically perform OCR.

But it seems to me that you’re actually trying to have the LLM learn from the PDFs, the way a human would by reading them. So that, downstream, you can ask the LLM questions, and it’ll answer in the same way as a human would who has read all the PDFs.

Is that correct?

If so, a few pointers:

Training directly on the PDF content will only make your LLM generate text that looks similar to the content of those PDFs. If they are all from the same author, you’d be training an LLM to write like that author.
To train on the “knowledge” contained in the PDFs, you’d have to create a training set first. The training set needs to have a structure similar to the way you’ll want to use the LLM downstream.
With non-fiction books, much of what a human would call the “knowledge” is not actually contained in the book. Instead, the book often explains a rule (why/what/how/what if). And when a human later answers a question based on the book, the answer will be not from quoting parts of the book, but from applying the rules explained in the book. So to train an LLM on, say, books about copywriting, it would be a mistake to just train it on the content of that book. This would only create a machine that can generate text that sounds like a copywriting handbook - but not one that can produce copywriting. To do the latter, you need to add another layer in your {extraction → training → inference} pipeline.

Hope that helps!

imvbhuvan · November 13, 2024, 8:31am

Yup, thank you for the points.
Can I connect with you to discuss further on a interesting problem we are solving ?
Hoping to connect with you.

tripti27 · November 15, 2024, 8:47am

leobg:

Sounds to me like there is a misunderstanding going on.

When you say “fine tune on PDF documents”, it sounds like you’re trying to train an LLM to extract text from an image-based PDF file. Basically perform OCR.

But it seems to me that you’re actually trying to have the LLM learn from the PDFs, the way a human would by reading them. So that, downstream, you can ask the LLM questions, and it’ll answer in the same way as a human would who has read all the PDFs.

Is that correct?

If so, a few pointers:

Training directly on the PDF content will only make your LLM generate text that looks similar to the content of those PDFs. If they are all from the same author, you’d be training an LLM to write like that author.

To train on the “knowledge” contained in the PDFs, you’d have to create a training set first. The training set needs to have a structure similar to the way you’ll want to use the LLM downstream.

With non-fiction books, much of what a human would call the “knowledge” is not actually contained in the book. Instead, the book often explains a rule (why/what/how/what if). And when a human later answers a question based on the book, the answer will be not from quoting parts of the book, but from applying the rules explained in the book. So to train an LLM on, say, books about copywriting, it would be a mistake to just train it on the content of that book. This would only create a machine that can generate text that sounds like a copywriting handbook - but not one that can produce copywriting. To do the latter, you need to add another layer in your {extraction → training → inference} pipeline.

Thanks for this meaningful reply. Yet, my doubt is, would it be required to continued pretraining or finetune with the pdf training data. As I am making raw text but also qa pairs from those texts so that model knows how the questions will come and how it has to respond? So, do I need to finetune or continue pretraining. Also, If I need to pretrain then just 50 pdfs data are enough or need bulk of pdfs to generalize knowledge. Thanks in advance for any insight you could provide.

tripti27 · November 15, 2024, 8:48am

Hi, Can you please elaborate the solution.

Thanks in advance.

Chandrashekar · November 17, 2024, 11:13am

Hi IMvbhuv
I have fine tuned pdf to text with close to 98% accuracy.

Topic		Replies	Views
LLM fine-tune with domain specific pdf documents Models	20	25346	November 5, 2024
Generate dataset for fine tuning on PDF(s) 🤗Transformers	7	3902	August 3, 2025
Fine tuning llm model Models	2	4487	May 16, 2024
JSON response for pdf text data Beginners	1	591	June 10, 2024
Need Suggestion Research	2	219	April 19, 2024

Fine tune LLMs on PDF Documents

Related topics