Fine tune LLMs on PDF Documents

We are currently seeking assistance in fine-tuning the Mistral model using approximately 48 PDF documents. Specifically, our challenge lies in training the model using peft and preparing the documents for optimal fine-tuning. We are facing difficulties in locating suitable resources for this task, and we are also uncertain about the proper procedures for document preparation, storage, and supply.

If anyone within the community has expertise in this area or can provide guidance on the aforementioned aspects, we would greatly appreciate your assistance. Your insights and recommendations would be invaluable to our project.

I assume you want to extract raw text from the PDFs? In what kind of form you want the data for fine-tuning to be?

Here’s a link to one Jupyter notebook of our pipeline for experiments to fine-tune OpenAI models based on PDFs and bibliographic ground-truth metadata; it uses PyMuPDF for text extraction (imported with the name fitz).

Thank you for your response.

We aim to customize the LLMs for a specific domain by fine-tuning them using approximately 50 books. This process will enhance the model’s understanding of the domain’s nuances and potentially expand its vocabulary. However, my team and I lack knowledge on how to effectively store and process these PDFs for the LLM, as existing online resources primarily discuss instruction fine-tuning and other methods. Any help and guidance will be deeply remembered.

Hello there @imvbhuvan , Were you able to fine tune model using pdf (I assume unstructured data) ? I am also facing similar challenges. I have some pdfs and html website data and lack formatted structure. But the goal is to fine tune the model so it has the ability to understand the domain.

Thank you very much for the reply. I will email you shortly!!

Hi. Im also trying to fine tune mistral on some documents. Actually its text file extracted from 1-5 page pdf which will be context, then some questions on it and another txt file with rather longer structured form of answer on it (csv output). How did you create the dataset?

did you find a way to do it?’

There are 2 options here I’d say:

Make your own dataset and train on it. Im facing some issues with The excessive replies and stuff. Lit tends to give the reply in duplicates . But with greater dataset we can ensure improved performance. Convert all data in text using -layout option in pdf2text and then fine tune it using autotrain.