Fine tune LLMs on PDF Documents

imvbhuvan · November 17, 2024, 4:16pm

Would you like to connect with me for a disscusion on this please ?

Chandrashekar · November 18, 2024, 4:09am

Sure. This is my email id (and linkedin below)
shekar.ramamurthy@gmail.com

leobg · November 18, 2024, 11:04am

Sure. Email in profile.

leobg · November 18, 2024, 11:22am

If you want to keep adding new data, I don’t think fine-tuning makes a lot of sense. You’re much better off with a RAG pipeline: Any knowledge you add is then instantly available in your system’s “knowledge”.

50 PDFs are plenty. Not for pretraining, of course. Not for fine-tuning either (unless style emulation is all you’re after). But if you want to make a bot that answers based on the ideas/facts/knowledge inside them, using RAG and an additional augmentation step like I described in my previous post.

P.S. I also suspect that fine-tuning on knowledge often makes a model dumber. The reason is that you’re never really teaching the model “facts”. You’re teaching it specific sequences of tokens, i.e. specific ways to phrase answers. This also means that you’re also teaching your model that every other way of answering the question is wrong.

Which is, of course, mistaken. Because the correct answer to a question can be phrased and formatted in many different forms. So when you fine-tune, you often inadvertently punish the model for perfectly good answers that you simply didn’t think of when building your training set. It’s a bit like sending Albert Einstein to the military, hitting him over the head for each word he utters until all he will do is say “Yes, sir!”. You achieved compliance, for sure – but at what cost?

tripti27 · December 3, 2024, 5:25am

Thank you so much, @leobg, for clearing my doubt.

Mojo3 · January 8, 2025, 6:29pm

Hi there, I’m also facing the exact same issue, would greatly appreciate it if you could share some reliable resources or guidance.

wanderingdeveloper71 · January 16, 2025, 12:00am

If the problem is extracting data from the PDF itself in a structured manner, this might be helpful. Adobe has their own PDF extract API that outputs JSON. In particular, the documentation mentions the service extracts text, and tables.

I’ve linked to the docs below:
https://developer.adobe.com/document-services/docs/overview/pdf-extract-api/

Edti: Adding a helpful reddit post about Adobe PDF extract API for pdf parsing also linked below:
https://www.reddit.com/r/LocalLLaMA/comments/1anaooi/using_adobe_pdf_extract_api_for_pdf_parsing/

Alper1907 · February 7, 2025, 8:23am

Hi i make an example of rag system like your told. My datas are arround 1000 pages and it includes images, tables, text. I changed my project llama3.2 to deepseek project. I want discuss some topics about this project. I cant find your email on your profile. demirbagalper1@gmail.com is my gmail adress . Can u mail me ? if you send me mail ı can send you full codes and datas (it is my personel project) Have a good day

umishra · February 20, 2025, 7:30am

Hi @imvbhuvan I am trying to implement a similar solution - i.e. train an existing medium sized LLM on some documents internal to my organisation. I copied all the text in word/PDF documents into plain text file, loaded them as datasets with the huggingface load_dataset API, and trained the model. I tested the model before and after fine tuning, and evidently, the model produces different texts before and after fine tuning - the later being more inline with my documents. However, the generated text still isn’t quite useful.

Having read this thread, I am now confused as to whether this approach is correct or not.

Anyway, the first thing I need to accomplish is not just copy text from word/PDF document but to extract meaningfule text only, i.e. remove headers, etc . Till now, I have not been able to extract useful text.

I believe this is what you had asked in the original question. If so, can you please help me in achieving this please?

tripti27 · March 3, 2025, 12:03pm

Thank you so much. Read this today only.

Topic		Replies	Views
Creating Own model for custom data Beginners	1	265	November 5, 2024
Fine-Tuning a Language Model with Data Extracted from Multiple PDFs for a Chat Interface 🤗Transformers	2	2594	November 5, 2024
Seeking Advice on Fine-Tuning LLMs for Generating Documents Beginners	1	119	February 15, 2025
Fine-tuning Mistral help Models	0	864	December 4, 2023
LoRA Finetuning Beginners	3	418	January 16, 2025

Fine tune LLMs on PDF Documents

Related topics