LLM fine-tune with domain specific pdf documents

NajiAboo · June 7, 2023, 2:25am

We are looking to fine-tune a LLM model. We have domain specific pdf document. We need to fine-tune a LLM model with these documents and based on this document LLM model has to answer the asked questions.We trained gpt2 model with pdf chunks and it’s not giving answers for the question. We also tried with bloom 3B , which is also not giving as expected.
Any suggestions or support please .

MichaelKoo · June 14, 2023, 12:05am

It is dependent on multiple factors.

It is depends on the type of LLM model you are using. Different models have different capabilities such as text generation vs. text classification or name entity recognition. So you need to make sure you are choosing a language model that is able to do what you need. For example if you are want a model that is similar to ChatGPT you need a model that has text2text generation capabilities. Or if you want a question and answering model you need to choose one similar.

Also when using the transformer, on a trained model you need to configure the pipeline to do what you ask. For example configuring a causal language model can be used to answer multiple choice questions only if configured correctly.

Also for training purposes, PDFs as the text going in are not usable because models can only take raw text. So you first need to get rid of useless information such as empty lines, extra spaces, bullet points, and etc.

NajiAboo · June 14, 2023, 1:13am

Thanks for the reply. This really helps. The end goal of the fine-tuned model is to answer the questions based on the trained documents. As per my understanding, if we go with the Question and Answer type model, we need to have a set of questions in hand. As of now, we have only a limited question-and-answer set. and preparing the complete question and answer will be costly and time-consuming. The current approach we are planning is to train the cleaned document on Bloom, which will give us a base model. Then again fine-tune it on the question and answer set that we have. I believe this will give the model to learn about the domain and the model will be able to answer the question which was not there in the question and answer set.
Do you think this will work? please suggest . Once again thanks for taking time to reply .

tsan1 · June 22, 2023, 12:33am

This is a very interesting question. Eagerly awaiting some reply.

Vikram9503 · July 26, 2023, 2:43pm

Any help on this I am in the similar situation .

EquinoxElahin · September 13, 2023, 4:18pm

I’m very interested @NajiAboo . Do you have any lead to answer your own question ?

dbur · October 29, 2023, 12:36am

Any progress? Does fine tuning add pdf knowledge in to the model? Working any better than prompt based methods?

imvbhuvan · January 31, 2024, 5:57pm

We are looking to fine tune Mistral model on the pdfs data we have about 48 documents. We need to train the model using peft. We are unable to find proper resources to fine tune it, and more importantly how do we prepare those documents to fine tune the model, where do we store them, how to supply and do it. Can anyone help us in this regard please ?

imvbhuvan · February 22, 2024, 5:08am

Hii @NajiAboo
Could you please provide the code which you used to train the LLM model on the PDFs document, we want to do the same. We need to train Mistral model on some 50 PDFs but we are unable to find any resources since all the blogs and videos make use of RAG and Vector database, which is not feasible in our case.

Please do help us in this regard, thank you.

gurusanand · March 2, 2024, 1:28pm

Hi @imvbhuvan, @NajiAboo
If you happen to have the code, could you please provide the code which you used to train the LLM model on the PDFs document, we want to do the same. We need to train Mistral model on some 50 PDFs but we are unable to find any precise resources. Thank you

Jims6367 · March 12, 2024, 11:29am

Is it possible to perform PEFT finetuning on pdf data since it is unstrcutured?

br4tp1t · March 15, 2024, 7:20am

what i did to train my LLM on our documents, ive used GPT-4 API and wrote python code, to send text from document, and asked GPT to give me 20 questions to each document and the resposne was in json format with INPUT (question + doc text) and OUTPUT as answer. than ive finetuned my model with this data, and its working preety impressive, ive also added vector database where i store new documents and even on documents that are not in LLM is working very well.

saurabhksa · April 2, 2024, 9:06pm

I just used RAG on mistral and it gives very nice results. No need to fine tune (atleast for me) . And its not just domain specific, i tried mixed pdfs at once and results are great.
Let me know if you like to try out the code. Everything works locally so to avoid sending data externally.

dijktsjv · April 3, 2024, 2:36pm

I Would like to try the code! sound very interesting

saurabhksa · April 4, 2024, 8:43am

You can try this one

just a POC code, you may need to define HF_TOKEN in env variable if you don’t have model locally available and this will download the necessary models.

swtb · April 9, 2024, 1:43pm

RAG + Few Shot Prompting Mistral for example.

bitschips · April 15, 2024, 12:54pm

It sounds nice, thanks.

dsbitcoding · April 19, 2024, 6:07am

Hi @imvbhuvan
I noticed your work with training the LLM model on PDF documents, and I’m interested in doing something similar. However, most of the resources I’ve found focus on using RAG and Vector database, which isn’t feasible for my project. Did you manage to find a solution to train the Mistral model on PDFs? If so, could you please share your code or any resources you found helpful? I’d greatly appreciate your assistance in this matter. Thank you in advance for your help!

SaikiranBondi · July 3, 2024, 10:09am

Hi Mr Bhuvan, I am looking for this specific use case of finetuning llama3 or mistral downloaded in local, on my manuals pdfs, can you please guide me if you have found any solution for this.
Best Regards
Saikiran B

Shubham-1991 · September 13, 2024, 4:49am

Using Few Shot Prompting +RAG i got the best result , and was able to remove the unwanted answer .I had taken data of General Knowledge questions

Topic		Replies	Views
Train/finetune llm to anwer a set of questions in unstructured pdfs Beginners	1	1007	April 9, 2024
How to fine-tune an LLM model with an entire document in a format such as *.txt/docx/pdf ect 🤗AutoTrain	6	7250	August 21, 2024
Fine tuning llm model Models	2	4406	May 16, 2024
Train a model for document specific Q and A Community Calls	0	1007	February 19, 2023
Fine-Tuning a Language Model with Data Extracted from Multiple PDFs for a Chat Interface 🤗Transformers	2	2624	November 5, 2024

LLM fine-tune with domain specific pdf documents

Related topics