Question answeirng Fine tuning

Berny93 · May 31, 2024, 7:57am

Hello. I need to do fine tuning to extract some fields from invoices (text extracted from invoices); some pretrained models such as distilled-bert require a dataset with the content, questions, the answers and the start_id of the answer. I have a dataset containing only text extracted (the content) and the answers but not the start_id. How can I overcome this limitation? Do you know other ways or models requiring only quesions and answers as dataset?

nielsr · May 31, 2024, 1:03pm

There are multiple ways to solve this, as you’re working with invoices I’d assume a vision-language model to perform better than a text-only one.

See our blog post on document AI for an overview: Accelerating Document AI. Models like LayoutLM are better than text-only models like DistilBERT.

Nowadays there are also a lot of generative document AI models including PaliGemma, Idefics2, LLaVa,… besides Donut, Pix2Struct, UDOP.

You can find demo notebooks for all of those here: GitHub - NielsRogge/Transformers-Tutorials: This repository contains demos I made with the Transformers library by HuggingFace..

Another option is to fine-tune a text-only LLM on OCR-ed text as I explained here: Fine tune LLMs on PDF Documents - #9 by nielsr

Berny93 · June 3, 2024, 10:20am

Thank you very much for the answer! I will surely have a look

Topic		Replies	Views
How to get a model on patent data for question answering Intermediate	1	851	October 15, 2021
BERT fine-tuning Models	0	507	January 29, 2024
Help in Finetuning a DistilBert uncased Q/A model Models	0	274	June 2, 2021
Adding small data in fine tune model - bert Models	0	339	October 20, 2022
Inference from a fine-tuned model -- help with interpretation of results Beginners	3	369	January 26, 2024

Question answeirng Fine tuning

Related topics