Question answeirng Fine tuning

Hello. I need to do fine tuning to extract some fields from invoices (text extracted from invoices); some pretrained models such as distilled-bert require a dataset with the content, questions, the answers and the start_id of the answer. I have a dataset containing only text extracted (the content) and the answers but not the start_id. How can I overcome this limitation? Do you know other ways or models requiring only quesions and answers as dataset?

There are multiple ways to solve this, as you’re working with invoices I’d assume a vision-language model to perform better than a text-only one.

See our blog post on document AI for an overview: Accelerating Document AI. Models like LayoutLM are better than text-only models like DistilBERT.

Nowadays there are also a lot of generative document AI models including PaliGemma, Idefics2, LLaVa,… besides Donut, Pix2Struct, UDOP.

You can find demo notebooks for all of those here: GitHub - NielsRogge/Transformers-Tutorials: This repository contains demos I made with the Transformers library by HuggingFace..

Another option is to fine-tune a text-only LLM on OCR-ed text as I explained here: Fine tune LLMs on PDF Documents - #9 by nielsr

1 Like

Thank you very much for the answer! I will surely have a look