LLM fine-tune with domain specific pdf documents

We are looking to fine-tune a LLM model. We have domain specific pdf document. We need to fine-tune a LLM model with these documents and based on this document LLM model has to answer the asked questions.We trained gpt2 model with pdf chunks and it’s not giving answers for the question. We also tried with bloom 3B , which is also not giving as expected.
Any suggestions or support please .


It is dependent on multiple factors.

It is depends on the type of LLM model you are using. Different models have different capabilities such as text generation vs. text classification or name entity recognition. So you need to make sure you are choosing a language model that is able to do what you need. For example if you are want a model that is similar to ChatGPT you need a model that has text2text generation capabilities. Or if you want a question and answering model you need to choose one similar.

Also when using the transformer, on a trained model you need to configure the pipeline to do what you ask. For example configuring a causal language model can be used to answer multiple choice questions only if configured correctly.

Also for training purposes, PDFs as the text going in are not usable because models can only take raw text. So you first need to get rid of useless information such as empty lines, extra spaces, bullet points, and etc.


Thanks for the reply. This really helps. The end goal of the fine-tuned model is to answer the questions based on the trained documents. As per my understanding, if we go with the Question and Answer type model, we need to have a set of questions in hand. As of now, we have only a limited question-and-answer set. and preparing the complete question and answer will be costly and time-consuming. The current approach we are planning is to train the cleaned document on Bloom, which will give us a base model. Then again fine-tune it on the question and answer set that we have. I believe this will give the model to learn about the domain and the model will be able to answer the question which was not there in the question and answer set.
Do you think this will work? please suggest . Once again thanks for taking time to reply .

1 Like

This is a very interesting question. Eagerly awaiting some reply.

1 Like

Any help on this I am in the similar situation .

I’m very interested @NajiAboo . Do you have any lead to answer your own question ?

Any progress? Does fine tuning add pdf knowledge in to the model? Working any better than prompt based methods?

We are looking to fine tune Mistral model on the pdfs data we have about 48 documents. We need to train the model using peft. We are unable to find proper resources to fine tune it, and more importantly how do we prepare those documents to fine tune the model, where do we store them, how to supply and do it. Can anyone help us in this regard please ?

Hii @NajiAboo
Could you please provide the code which you used to train the LLM model on the PDFs document, we want to do the same. We need to train Mistral model on some 50 PDFs but we are unable to find any resources since all the blogs and videos make use of RAG and Vector database, which is not feasible in our case.

Please do help us in this regard, thank you.

Hi @imvbhuvan, @NajiAboo
If you happen to have the code, could you please provide the code which you used to train the LLM model on the PDFs document, we want to do the same. We need to train Mistral model on some 50 PDFs but we are unable to find any precise resources. Thank you

1 Like

Is it possible to perform PEFT finetuning on pdf data since it is unstrcutured?

what i did to train my LLM on our documents, ive used GPT-4 API and wrote python code, to send text from document, and asked GPT to give me 20 questions to each document and the resposne was in json format with INPUT (question + doc text) and OUTPUT as answer. than ive finetuned my model with this data, and its working preety impressive, ive also added vector database where i store new documents and even on documents that are not in LLM is working very well.

1 Like

I just used RAG on mistral and it gives very nice results. No need to fine tune (atleast for me) . And its not just domain specific, i tried mixed pdfs at once and results are great.
Let me know if you like to try out the code. Everything works locally so to avoid sending data externally.

1 Like

I Would like to try the code! sound very interesting

You can try this one

just a POC code, you may need to define HF_TOKEN in env variable if you don’t have model locally available and this will download the necessary models.

1 Like

RAG + Few Shot Prompting Mistral for example.

It sounds nice, thanks.

Hi @imvbhuvan
I noticed your work with training the LLM model on PDF documents, and I’m interested in doing something similar. However, most of the resources I’ve found focus on using RAG and Vector database, which isn’t feasible for my project. Did you manage to find a solution to train the Mistral model on PDFs? If so, could you please share your code or any resources you found helpful? I’d greatly appreciate your assistance in this matter. Thank you in advance for your help!