There are 2 options here I’d say:
- either you fine-tune a text-only LLM (like Mistral, LLaMa, etc.) on the OCR text of the PDF along with a text prompt and corresponding targets. Refer to my notebook on fine-tuning Mistral-7B (or any other LLM) on a custom dataset: Transformers-Tutorials/Mistral/Supervised_fine_tuning_(SFT)_of_an_LLM_using_Hugging_Face_tooling.ipynb at master · NielsRogge/Transformers-Tutorials · GitHub. It leverages Q-LoRa with the PEFT library. It’s based on the official scripts from the Alignment Handbook: GitHub - huggingface/alignment-handbook: Robust recipes to align language models with human and AI preferences. Basically one needs to prepare the dataset in a format that the model expects (by calling
tokenizer.apply_chat_template
on the inputs). What I’d recommend is applying an OCR engine of choice on the PDF documents (such as Tesseract, or closed-source APIs like the ones from Google Cloud or Azure). - either you fine-tune a vision-language model (like Idefics2, Llava, Llava-NeXT, PaliGemma) to take in the image(s) of the PDF, typically one image per page, + an optional text prompt, and produce a certain desired target. One example could be take in a PDF and produce corresponding JSON, containing desired keys and values from the PDF. Refer to the example notebook regarding fine-tuning PaliGemma or the example notebook regarding fine-tuning Idefics2. The benefit of Idefics2 is that it encodes images very efficiently which means you could easily fine-tune it on PDFs which consists of multiple pages.
The latter approach is basically this:
Update: more and more VLMs now support fine-tuning on multi-page documents. Some recent example include Qwen2-VL and LLaVa-OneVision.