Fine tune LLMs on PDF Documents

Would you like to connect with me for a disscusion on this please ?

Sure. This is my email id (and linkedin below)
shekar.ramamurthy@gmail.com

Sure. Email in profile.

If you want to keep adding new data, I don’t think fine-tuning makes a lot of sense. You’re much better off with a RAG pipeline: Any knowledge you add is then instantly available in your system’s “knowledge”.

50 PDFs are plenty. Not for pretraining, of course. Not for fine-tuning either (unless style emulation is all you’re after). But if you want to make a bot that answers based on the ideas/facts/knowledge inside them, using RAG and an additional augmentation step like I described in my previous post.

P.S. I also suspect that fine-tuning on knowledge often makes a model dumber. The reason is that you’re never really teaching the model “facts”. You’re teaching it specific sequences of tokens, i.e. specific ways to phrase answers. This also means that you’re also teaching your model that every other way of answering the question is wrong.

Which is, of course, mistaken. Because the correct answer to a question can be phrased and formatted in many different forms. So when you fine-tune, you often inadvertently punish the model for perfectly good answers that you simply didn’t think of when building your training set. It’s a bit like sending Albert Einstein to the military, hitting him over the head for each word he utters until all he will do is say “Yes, sir!”. You achieved compliance, for sure – but at what cost?

1 Like

Thank you so much, @leobg, for clearing my doubt.

1 Like

Hi there, I’m also facing the exact same issue, would greatly appreciate it if you could share some reliable resources or guidance.

1 Like

If the problem is extracting data from the PDF itself in a structured manner, this might be helpful. Adobe has their own PDF extract API that outputs JSON. In particular, the documentation mentions the service extracts text, and tables.

I’ve linked to the docs below:
https://developer.adobe.com/document-services/docs/overview/pdf-extract-api/

Edti: Adding a helpful reddit post about Adobe PDF extract API for pdf parsing also linked below:
https://www.reddit.com/r/LocalLLaMA/comments/1anaooi/using_adobe_pdf_extract_api_for_pdf_parsing/

1 Like

Hi i make an example of rag system like your told. My datas are arround 1000 pages and it includes images, tables, text. I changed my project llama3.2 to deepseek project. I want discuss some topics about this project. I cant find your email on your profile. demirbagalper1@gmail.com is my gmail adress . Can u mail me ? if you send me mail ı can send you full codes and datas (it is my personel project) Have a good day :slight_smile:

1 Like

Hi @imvbhuvan I am trying to implement a similar solution - i.e. train an existing medium sized LLM on some documents internal to my organisation. I copied all the text in word/PDF documents into plain text file, loaded them as datasets with the huggingface load_dataset API, and trained the model. I tested the model before and after fine tuning, and evidently, the model produces different texts before and after fine tuning - the later being more inline with my documents. However, the generated text still isn’t quite useful.

Having read this thread, I am now confused as to whether this approach is correct or not.

Anyway, the first thing I need to accomplish is not just copy text from word/PDF document but to extract meaningfule text only, i.e. remove headers, etc . Till now, I have not been able to extract useful text.

I believe this is what you had asked in the original question. If so, can you please help me in achieving this please?

1 Like

Thank you so much. Read this today only.

1 Like