Can Donut model be used to query Multipage documents?

SaiKirtana · June 21, 2023, 1:08pm

Hello,
I am working on a task to query PDF documents. Almost all of the trained models I came across are designed to query single page documents. When I came across Donut model, I tried to leverage pdf2image library and converted PDF to images, then passed the same to the model. I tried for upto 5-page document. The model clearly didn’t throw any error.
My query here is:

Can donut efficiently be used to answer queries from a multi-page document in above-mentioned manner, when it was trained using single page documents?
Is there any parameter that I should be specifically aware of, that can restrict the performance of the model while leveraging it for multipage document querying?

Any guidance on how to proceed with this task is also welcome. Thank you in advance!

While I am yet to explore Hi-VT5 model for multipage documents, I am still interested in Donut since it is OCR free model.

from pdf2image import convert_from_path
import re
from transformers import DonutProcessor, VisionEncoderDecoderModel
from datasets import load_dataset
import torch
from PIL import Image

def generateAnswerPDF(pdf_filepath, question):
    processor = DonutProcessor.from_pretrained("naver-clova-ix/donut-base-finetuned-docvqa")
    model = VisionEncoderDecoderModel.from_pretrained("naver-clova-ix/donut-base-finetuned-docvqa")
    
    # Store Pdf with convert_from_path function
    images = convert_from_path(pdf_filepath)

    print(len(images))
    #print(images)
    # prepare decoder inputs
    task_prompt = "<s_docvqa><s_question>{user_input}</s_question><s_answer>"
    prompt = task_prompt.replace("{user_input}", question)
    decoder_input_ids = processor.tokenizer(prompt, add_special_tokens=False, return_tensors="pt").input_ids

    pixel_values = processor(images, return_tensors="pt").pixel_values
    device = "cuda" if torch.cuda.is_available() else "cpu"
    model.to(device)
    outputs = model.generate(
        pixel_values.to(device),
        decoder_input_ids=decoder_input_ids.to(device),
        max_length=model.decoder.config.max_position_embeddings,
        early_stopping=True,
        pad_token_id=processor.tokenizer.pad_token_id,
        eos_token_id=processor.tokenizer.eos_token_id,
        use_cache=True,
        num_beams=1,
        bad_words_ids=[[processor.tokenizer.unk_token_id]],
        return_dict_in_generate=True,
    )
    sequence = processor.batch_decode(outputs.sequences)[0]
    sequence = sequence.replace(processor.tokenizer.eos_token, "").replace(processor.tokenizer.pad_token, "")
    sequence = re.sub(r"<.*?>", "", sequence, count=1).strip()  # remove first task start token

    print(processor.token2json(sequence))

pdf_filepath = 'handbook.pdf' # PDF file path
question = "When do we need to adjust the temperature?"

Anooja · October 17, 2023, 9:24am

Hello SaiKirtana,

I am also facing the same issue? Have you got any solution for this?
Please let me know at anujarjun11@gmail.com

thefaheem · February 20, 2024, 1:16pm

Any Solutions?

riccardodemaria · January 29, 2025, 5:55am

I am also working on a similar project. What did you end up doing?

Topic		Replies	Views
Donut fine tuning question 🤗Optimum	0	1624	October 16, 2023
Multi-page Document Classification Models	3	2682	March 22, 2024
Sagemaker VQA Models (Donut) Beginners	0	555	August 8, 2023
Donut base-sized model, pre-trained only for a new language tutorial Models	2	1043	February 19, 2023
Creating custom Donut model Models	0	713	March 16, 2023

Can Donut model be used to query Multipage documents?

Related topics