We are currently seeking assistance in fine-tuning the Mistral model using approximately 48 PDF documents. Specifically, our challenge lies in training the model using peft and preparing the documents for optimal fine-tuning. We are facing difficulties in locating suitable resources for this task, and we are also uncertain about the proper procedures for document preparation, storage, and supply.
If anyone within the community has expertise in this area or can provide guidance on the aforementioned aspects, we would greatly appreciate your assistance. Your insights and recommendations would be invaluable to our project.
I assume you want to extract raw text from the PDFs? In what kind of form you want the data for fine-tuning to be?
Here’s a link to one Jupyter notebook of our pipeline for experiments to fine-tune OpenAI models based on PDFs and bibliographic ground-truth metadata; it uses PyMuPDF for text extraction (imported with the name fitz).
We aim to customize the LLMs for a specific domain by fine-tuning them using approximately 50 books. This process will enhance the model’s understanding of the domain’s nuances and potentially expand its vocabulary. However, my team and I lack knowledge on how to effectively store and process these PDFs for the LLM, as existing online resources primarily discuss instruction fine-tuning and other methods. Any help and guidance will be deeply remembered.
Hello there @imvbhuvan , Were you able to fine tune model using pdf (I assume unstructured data) ? I am also facing similar challenges. I have some pdfs and html website data and lack formatted structure. But the goal is to fine tune the model so it has the ability to understand the domain.
Hi. Im also trying to fine tune mistral on some documents. Actually its text file extracted from 1-5 page pdf which will be context, then some questions on it and another txt file with rather longer structured form of answer on it (csv output). How did you create the dataset?
either you fine-tune a vision-language model (like Idefics2, Llava, Llava-NeXT, PaliGemma) to take in the image(s) of the PDF, typically one image per page, + an optional text prompt, and produce a certain desired target. One example could be take in a PDF and produce corresponding JSON, containing desired keys and values from the PDF. Refer to the example notebook regarding fine-tuning PaliGemma or the example notebook regarding fine-tuning Idefics2. The benefit of Idefics2 is that it encodes images very efficiently which means you could easily fine-tune it on PDFs which consists of multiple pages.
Make your own dataset and train on it. Im facing some issues with The excessive replies and stuff. Lit tends to give the reply in duplicates . But with greater dataset we can ensure improved performance. Convert all data in text using -layout option in pdf2text and then fine tune it using autotrain.
In general, when you talk about PDF to JSON, in the notebook we find image to JSON (it’s like converting multiple PDFs to a single image).
But in my case, I have a PDF with multiple pages. The PDF is divided into paragraphs, each with its own properties. The problem is that if I convert the PDF into a single image, it risks losing information since a paragraph might span two images.
My goal is to convert the entire PDF into one JSON file.
So, I wonder if it is possible to divide a PDF into multiple images that will be given as input, and the output will be a single JSON file. If yes, how?
Hi Bhuvan were able to succeed and perform well inferences on the same task. I am also researching for the same did not find a great way out as of yet. Can you help me in the same.
HI Sabber, I also face similar problem, I have some 100s of pdfs on which I want to train/finetune llm. The thing is, when I train it on raw text, pdf converted to text data, it is not very well formatted and hence, the llm so finetuned, is not able to work correctly. Please help if u know some solution. Also, the pdfs have tables and pymuPDF, tabula etc are not able to extract tabular data from such pdfs. Please help if u know some solution.
Hi, I also face similar problem, I have some 100s of pdfs on which I want to train/finetune llm. The thing is, when I train it on raw text, pdf converted to text data, it is not very well formatted and hence, the llm so finetuned, is not able to work correctly. Please help if u know some solution. Also, the pdfs have tables and pymuPDF, tabula etc are not able to extract tabular data from such pdfs. Please help if u know some solution.
Thanks. But can u please elaborate the process of how to use pdf data. The problem is, I have tables in pdfs and mostly are embedded as image. So even if an ocr extracts tabular pdf data into text how to use them in training. Did u simply just put entire text files into training? Thanks in advance
Sounds to me like there is a misunderstanding going on.
When you say “fine tune on PDF documents”, it sounds like you’re trying to train an LLM to extract text from an image-based PDF file. Basically perform OCR.
But it seems to me that you’re actually trying to have the LLM learn from the PDFs, the way a human would by reading them. So that, downstream, you can ask the LLM questions, and it’ll answer in the same way as a human would who has read all the PDFs.
Is that correct?
If so, a few pointers:
Training directly on the PDF content will only make your LLM generate text that looks similar to the content of those PDFs. If they are all from the same author, you’d be training an LLM to write like that author.
To train on the “knowledge” contained in the PDFs, you’d have to create a training set first. The training set needs to have a structure similar to the way you’ll want to use the LLM downstream.
With non-fiction books, much of what a human would call the “knowledge” is not actually contained in the book. Instead, the book often explains a rule (why/what/how/what if). And when a human later answers a question based on the book, the answer will be not from quoting parts of the book, but from applying the rules explained in the book. So to train an LLM on, say, books about copywriting, it would be a mistake to just train it on the content of that book. This would only create a machine that can generate text that sounds like a copywriting handbook - but not one that can produce copywriting. To do the latter, you need to add another layer in your {extraction → training → inference} pipeline.
Thanks for this meaningful reply. Yet, my doubt is, would it be required to continued pretraining or finetune with the pdf training data. As I am making raw text but also qa pairs from those texts so that model knows how the questions will come and how it has to respond? So, do I need to finetune or continue pretraining. Also, If I need to pretrain then just 50 pdfs data are enough or need bulk of pdfs to generalize knowledge. Thanks in advance for any insight you could provide.