Generate dataset for fine tuning on PDF(s)

sujitrect · September 2, 2024, 6:29am

Hi Everyone,
Most of the resources regarding finetuning use a pre-existing dataset.
Given that I have one PDF, how do I generate a dataset which can be used for finetuning, so that I can use the fine tuned model for answering user queries in context of the document?

I understand that RAG is recommended mechanism for this use case, but I want to try out finetuning anyways, but stuck on how to generate dataset.
I have seen suggestions on a mechanism where people have used LLM to generate questions and prepare dataset. But, how can that cover all possible questions that can be asked. Also, I am not finding any concrete implementation.
Please help me point to any githubs that do this already on a sample pdf.

John6666 · September 2, 2024, 6:43am

I know it’s not the exact answer, but maybe the logic in this space will help.
Because there is no PDF conversion in the original HF functionality, so it must be converted manually internally and fed to the LLM.

I think someone made a space for converting PDFs to datasets, but I can’t find it.
Maybe it’s somewhere in this.

mikehemberger · September 2, 2024, 7:06am

Hi @sujitrect ,
In any case (RAG or fine-tuning) you have to extract information from the PDF. Both LangChain and LlamaIndex have the functionality that you need. Hope this helps establishing your dataset.

From there onwards everything depends on what you want to fine-tune the model for. For QA I would definitely start using RAG. This can then serve as a baseline that you can compare your fine-tuning efforts to.

Best,
M

rishisuresh · September 3, 2024, 3:57am

@sujitrect - Thank you for asking this question. I was asking the same question to my friends as well. Am following this thread.

rishisuresh · September 3, 2024, 4:06am

@mikehemberger - I am a newbie here. From what I understand, the HF models are meant for 1 particular task type (QA, Text Gen, Seq2Seq, Classification etc). We would like to fine-tune a pretrained model (like Mistral or Llama) to gain knowledge on the numerous domain specific documents we have. We are expecting to use the finetuned model for mostly question answering or summarization. In case of QA, we can come up with 1000s of questions/answers pairs to train the model. However, they wont be sufficient to cover all the knowledge that the documents have. So, the question is how do we feed all the documents for finetuning instead of training it just with 1000s of questions/answers. We can use RAG but what if we want to use the finetuning approach. Any insights will greatly help. Thank you!

mikehemberger · September 3, 2024, 8:53am

Hi @rishisuresh ,
I’m definitely not an expert on this but that’s sounds like you’re looking for an unsupervised approach to fine-tuning (as you want to cover many-to-all knowledge in your dataset but - I assume - the documents vary in structure and information content, rendering manual information extraction not feasible).
As far as I understand the current LLM situation that would be the case of „pretraining“ using next word prediction (most likely also not feasible) or using PEFT.
I haven’t done either of those yet, sorry.
Hope someone is able to help.
Best,
M

rishisuresh · September 3, 2024, 1:08pm

Thanks a lot @mikehemberger

ayabongaqwabi · August 3, 2025, 5:56pm

I have a tool similar to what you are looking for GitHub - AyabongaQwabi/historybook_to_dataset: Converts a history book to a machine learning dataset for model finetuning

Topic		Replies	Views
Train/finetune llm to anwer a set of questions in unstructured pdfs Beginners	1	1026	April 9, 2024
Fine tune LLMs on PDF Documents Models	29	34026	March 3, 2025
LLM fine-tune with domain specific pdf documents Models	20	25264	November 5, 2024
Read data of pdf or just image format as a part of promt Intermediate	0	1353	May 29, 2023
Fine-Tuning a Language Model with Data Extracted from Multiple PDFs for a Chat Interface 🤗Transformers	2	2685	November 5, 2024

Generate dataset for fine tuning on PDF(s)

Related topics