We are currently seeking assistance in fine-tuning the Mistral model using approximately 48 PDF documents. Specifically, our challenge lies in training the model using peft and preparing the documents for optimal fine-tuning. We are facing difficulties in locating suitable resources for this task, and we are also uncertain about the proper procedures for document preparation, storage, and supply.
If anyone within the community has expertise in this area or can provide guidance on the aforementioned aspects, we would greatly appreciate your assistance. Your insights and recommendations would be invaluable to our project.
I assume you want to extract raw text from the PDFs? In what kind of form you want the data for fine-tuning to be?
Here’s a link to one Jupyter notebook of our pipeline for experiments to fine-tune OpenAI models based on PDFs and bibliographic ground-truth metadata; it uses PyMuPDF for text extraction (imported with the name fitz).
We aim to customize the LLMs for a specific domain by fine-tuning them using approximately 50 books. This process will enhance the model’s understanding of the domain’s nuances and potentially expand its vocabulary. However, my team and I lack knowledge on how to effectively store and process these PDFs for the LLM, as existing online resources primarily discuss instruction fine-tuning and other methods. Any help and guidance will be deeply remembered.
Hello there @imvbhuvan , Were you able to fine tune model using pdf (I assume unstructured data) ? I am also facing similar challenges. I have some pdfs and html website data and lack formatted structure. But the goal is to fine tune the model so it has the ability to understand the domain.
Hi. Im also trying to fine tune mistral on some documents. Actually its text file extracted from 1-5 page pdf which will be context, then some questions on it and another txt file with rather longer structured form of answer on it (csv output). How did you create the dataset?
either you fine-tune a vision-language model (like Idefics2, Llava, Llava-NeXT, PaliGemma) to take in the image(s) of the PDF, typically one image per page, + an optional text prompt, and produce a certain desired target. One example could be take in a PDF and produce corresponding JSON, containing desired keys and values from the PDF. Refer to the example notebook regarding fine-tuning PaliGemma or the example notebook regarding fine-tuning Idefics2. The benefit of Idefics2 is that it encodes images very efficiently which means you could easily fine-tune it on PDFs which consists of multiple pages.
Make your own dataset and train on it. Im facing some issues with The excessive replies and stuff. Lit tends to give the reply in duplicates . But with greater dataset we can ensure improved performance. Convert all data in text using -layout option in pdf2text and then fine tune it using autotrain.
In general, when you talk about PDF to JSON, in the notebook we find image to JSON (it’s like converting multiple PDFs to a single image).
But in my case, I have a PDF with multiple pages. The PDF is divided into paragraphs, each with its own properties. The problem is that if I convert the PDF into a single image, it risks losing information since a paragraph might span two images.
My goal is to convert the entire PDF into one JSON file.
So, I wonder if it is possible to divide a PDF into multiple images that will be given as input, and the output will be a single JSON file. If yes, how?
Hi Bhuvan were able to succeed and perform well inferences on the same task. I am also researching for the same did not find a great way out as of yet. Can you help me in the same.