Fine tune LLMs on PDF Documents

nielsr · May 1, 2024, 1:02pm

There are 2 options here I’d say:

either you fine-tune a text-only LLM (like Mistral, LLaMa, etc.) on the OCR text of the PDF along with a text prompt and corresponding targets. Refer to my notebook on fine-tuning Mistral-7B (or any other LLM) on a custom dataset: Transformers-Tutorials/Mistral/Supervised_fine_tuning_(SFT)_of_an_LLM_using_Hugging_Face_tooling.ipynb at master · NielsRogge/Transformers-Tutorials · GitHub. It leverages Q-LoRa with the PEFT library. It’s based on the official scripts from the Alignment Handbook: GitHub - huggingface/alignment-handbook: Robust recipes to align language models with human and AI preferences. Basically one needs to prepare the dataset in a format that the model expects (by calling tokenizer.apply_chat_template on the inputs). What I’d recommend is applying an OCR engine of choice on the PDF documents (such as Tesseract, or closed-source APIs like the ones from Google Cloud or Azure).
either you fine-tune a vision-language model (like Idefics2, Llava, Llava-NeXT, PaliGemma) to take in the image(s) of the PDF, typically one image per page, + an optional text prompt, and produce a certain desired target. One example could be take in a PDF and produce corresponding JSON, containing desired keys and values from the PDF. Refer to the example notebook regarding fine-tuning PaliGemma or the example notebook regarding fine-tuning Idefics2. The benefit of Idefics2 is that it encodes images very efficiently which means you could easily fine-tune it on PDFs which consists of multiple pages.

The latter approach is basically this:

Update: more and more VLMs now support fine-tuning on multi-page documents. Some recent example include Qwen2-VL and LLaVa-OneVision.

Topic		Replies	Views
LLM fine-tune with domain specific pdf documents Models	20	25035	November 5, 2024
Generate dataset for fine tuning on PDF(s) 🤗Transformers	6	3378	September 3, 2024
Fine-Tuning a Language Model with Data Extracted from Multiple PDFs for a Chat Interface 🤗Transformers	2	2638	November 5, 2024
How to use a LLM for specific task Beginners	2	91	March 14, 2025
Need Suggestion Research	2	216	April 19, 2024

Fine tune LLMs on PDF Documents

Related topics