LLM fine-tune with domain specific pdf documents

We are looking to fine-tune a LLM model. We have domain specific pdf document. We need to fine-tune a LLM model with these documents and based on this document LLM model has to answer the asked questions.We trained gpt2 model with pdf chunks and it’s not giving answers for the question. We also tried with bloom 3B , which is also not giving as expected.
Any suggestions or support please .


It is dependent on multiple factors.

It is depends on the type of LLM model you are using. Different models have different capabilities such as text generation vs. text classification or name entity recognition. So you need to make sure you are choosing a language model that is able to do what you need. For example if you are want a model that is similar to ChatGPT you need a model that has text2text generation capabilities. Or if you want a question and answering model you need to choose one similar.

Also when using the transformer, on a trained model you need to configure the pipeline to do what you ask. For example configuring a causal language model can be used to answer multiple choice questions only if configured correctly.

Also for training purposes, PDFs as the text going in are not usable because models can only take raw text. So you first need to get rid of useless information such as empty lines, extra spaces, bullet points, and etc.

1 Like

Thanks for the reply. This really helps. The end goal of the fine-tuned model is to answer the questions based on the trained documents. As per my understanding, if we go with the Question and Answer type model, we need to have a set of questions in hand. As of now, we have only a limited question-and-answer set. and preparing the complete question and answer will be costly and time-consuming. The current approach we are planning is to train the cleaned document on Bloom, which will give us a base model. Then again fine-tune it on the question and answer set that we have. I believe this will give the model to learn about the domain and the model will be able to answer the question which was not there in the question and answer set.
Do you think this will work? please suggest . Once again thanks for taking time to reply .

1 Like

This is a very interesting question. Eagerly awaiting some reply.

1 Like

Any help on this I am in the similar situation .

I’m very interested @NajiAboo . Do you have any lead to answer your own question ?