Train/finetune llm to anwer a set of questions in unstructured pdfs

CharlesM · October 12, 2023, 1:38pm

Hi, I am going through an issue I have a hard time finding informations on a specific task I want an opensource LLM to be able to do.
I have numerous unstructured pdfs (Company Annual financial reports of hundreds of pages) and for each, the answers to the same questions : (ex What is the percentage of women on the board? (in %), Is Biodiversity mentionned in the report ? (Yes/No)). The set of question are mostly boolean or numbers to retrieve. I want to train my model to more accurately answer these same questions for other unstructured pdfs. I have tried with a basic llm pdf chat app but the results are really bad.

Even after browsing the internet for days, I can’t seem to find which solution fits the best to my issue and how to implement it.
Thanks in advance, any advice is welcome !

swtb · April 9, 2024, 1:46pm

Embed your data and use RAG to retrieve documents relevant to the query. Add the contents of these documents to your prompt. Use a recent model such as Mistral/Mixtral. Langchain has a guide for RAG.

Topic		Replies	Views
LLM fine-tune with domain specific pdf documents Models	20	25007	November 5, 2024
Open source LLM model for question answering Beginners	0	267	August 12, 2024
Fine-Tuning a Language Model with Data Extracted from Multiple PDFs for a Chat Interface 🤗Transformers	2	2632	November 5, 2024
Question answering model using open source LLM Models	0	2090	May 1, 2023
How to fine-tune an LLM model with an entire document in a format such as *.txt/docx/pdf ect 🤗AutoTrain	6	7264	August 21, 2024

Train/finetune llm to anwer a set of questions in unstructured pdfs

Related topics