Fine tune LLMs on PDF Documents

We are currently seeking assistance in fine-tuning the Mistral model using approximately 48 PDF documents. Specifically, our challenge lies in training the model using peft and preparing the documents for optimal fine-tuning. We are facing difficulties in locating suitable resources for this task, and we are also uncertain about the proper procedures for document preparation, storage, and supply.

If anyone within the community has expertise in this area or can provide guidance on the aforementioned aspects, we would greatly appreciate your assistance. Your insights and recommendations would be invaluable to our project.

1 Like

I assume you want to extract raw text from the PDFs? In what kind of form you want the data for fine-tuning to be?

Here’s a link to one Jupyter notebook of our pipeline for experiments to fine-tune OpenAI models based on PDFs and bibliographic ground-truth metadata; it uses PyMuPDF for text extraction (imported with the name fitz).

1 Like

Thank you for your response.

We aim to customize the LLMs for a specific domain by fine-tuning them using approximately 50 books. This process will enhance the model’s understanding of the domain’s nuances and potentially expand its vocabulary. However, my team and I lack knowledge on how to effectively store and process these PDFs for the LLM, as existing online resources primarily discuss instruction fine-tuning and other methods. Any help and guidance will be deeply remembered.

Hello there @imvbhuvan , Were you able to fine tune model using pdf (I assume unstructured data) ? I am also facing similar challenges. I have some pdfs and html website data and lack formatted structure. But the goal is to fine tune the model so it has the ability to understand the domain.

Thank you very much for the reply. I will email you shortly!!

Hi. Im also trying to fine tune mistral on some documents. Actually its text file extracted from 1-5 page pdf which will be context, then some questions on it and another txt file with rather longer structured form of answer on it (csv output). How did you create the dataset?

did you find a way to do it?’

There are 2 options here I’d say:

  • either you fine-tune a text-only LLM (like Mistral, LLaMa, etc.) on the OCR text of the PDF along with a text prompt and corresponding targets. Refer to my notebook on fine-tuning Mistral-7B (or any other LLM) on a custom dataset: Transformers-Tutorials/Mistral/Supervised_fine_tuning_(SFT)_of_an_LLM_using_Hugging_Face_tooling.ipynb at master · NielsRogge/Transformers-Tutorials · GitHub. It leverages Q-LoRa with the PEFT library. It’s based on the official scripts from the Alignment Handbook: GitHub - huggingface/alignment-handbook: Robust recipes to align language models with human and AI preferences. Basically one needs to prepare the dataset in a format that the model expects (by calling tokenizer.apply_chat_template on the inputs). What I’d recommend is applying an OCR engine of choice on the PDF documents (such as Tesseract, or closed-source APIs like the ones from Google Cloud or Azure).
  • either you fine-tune a vision-language model (like Idefics2, Llava, Llava-NeXT, PaliGemma) to take in the image(s) of the PDF, typically one image per page, + an optional text prompt, and produce a certain desired target. One example could be take in a PDF and produce corresponding JSON, containing desired keys and values from the PDF. Refer to the example notebook regarding fine-tuning PaliGemma or the example notebook regarding fine-tuning Idefics2. The benefit of Idefics2 is that it encodes images very efficiently which means you could easily fine-tune it on PDFs which consists of multiple pages.

The latter approach is basically this:

Update: more and more VLMs now support fine-tuning on multi-page documents. Some recent example include Qwen2-VL and LLaVa-OneVision.

3 Likes

Make your own dataset and train on it. Im facing some issues with The excessive replies and stuff. Lit tends to give the reply in duplicates . But with greater dataset we can ensure improved performance. Convert all data in text using -layout option in pdf2text and then fine tune it using autotrain.

Thank you for this information !

In general, when you talk about PDF to JSON, in the notebook we find image to JSON (it’s like converting multiple PDFs to a single image).

:exclamation:But in my case, I have a PDF with multiple pages. The PDF is divided into paragraphs, each with its own properties. The problem is that if I convert the PDF into a single image, it risks losing information since a paragraph might span two images.

My goal is to convert the entire PDF into one JSON file.

So, I wonder if it is possible to divide a PDF into multiple images that will be given as input, and the output will be a single JSON file. If yes, how?

Hi Bhuvan were able to succeed and perform well inferences on the same task. I am also researching for the same did not find a great way out as of yet. Can you help me in the same.

HI Sabber, I also face similar problem, I have some 100s of pdfs on which I want to train/finetune llm. The thing is, when I train it on raw text, pdf converted to text data, it is not very well formatted and hence, the llm so finetuned, is not able to work correctly. Please help if u know some solution. Also, the pdfs have tables and pymuPDF, tabula etc are not able to extract tabular data from such pdfs. Please help if u know some solution.

1 Like

Hi, I also face similar problem, I have some 100s of pdfs on which I want to train/finetune llm. The thing is, when I train it on raw text, pdf converted to text data, it is not very well formatted and hence, the llm so finetuned, is not able to work correctly. Please help if u know some solution. Also, the pdfs have tables and pymuPDF, tabula etc are not able to extract tabular data from such pdfs. Please help if u know some solution.