Can Donut model be used to query Multipage documents?

Hello,
I am working on a task to query PDF documents. Almost all of the trained models I came across are designed to query single page documents. When I came across Donut model, I tried to leverage pdf2image library and converted PDF to images, then passed the same to the model. I tried for upto 5-page document. The model clearly didn’t throw any error.
My query here is:

  1. Can donut efficiently be used to answer queries from a multi-page document in above-mentioned manner, when it was trained using single page documents?
  2. Is there any parameter that I should be specifically aware of, that can restrict the performance of the model while leveraging it for multipage document querying?

Any guidance on how to proceed with this task is also welcome. Thank you in advance!

While I am yet to explore Hi-VT5 model for multipage documents, I am still interested in Donut since it is OCR free model.

from pdf2image import convert_from_path
import re
from transformers import DonutProcessor, VisionEncoderDecoderModel
from datasets import load_dataset
import torch
from PIL import Image

def generateAnswerPDF(pdf_filepath, question):
    processor = DonutProcessor.from_pretrained("naver-clova-ix/donut-base-finetuned-docvqa")
    model = VisionEncoderDecoderModel.from_pretrained("naver-clova-ix/donut-base-finetuned-docvqa")
    
    # Store Pdf with convert_from_path function
    images = convert_from_path(pdf_filepath)

    print(len(images))
    #print(images)
    # prepare decoder inputs
    task_prompt = "<s_docvqa><s_question>{user_input}</s_question><s_answer>"
    prompt = task_prompt.replace("{user_input}", question)
    decoder_input_ids = processor.tokenizer(prompt, add_special_tokens=False, return_tensors="pt").input_ids

    pixel_values = processor(images, return_tensors="pt").pixel_values
    device = "cuda" if torch.cuda.is_available() else "cpu"
    model.to(device)
    outputs = model.generate(
        pixel_values.to(device),
        decoder_input_ids=decoder_input_ids.to(device),
        max_length=model.decoder.config.max_position_embeddings,
        early_stopping=True,
        pad_token_id=processor.tokenizer.pad_token_id,
        eos_token_id=processor.tokenizer.eos_token_id,
        use_cache=True,
        num_beams=1,
        bad_words_ids=[[processor.tokenizer.unk_token_id]],
        return_dict_in_generate=True,
    )
    sequence = processor.batch_decode(outputs.sequences)[0]
    sequence = sequence.replace(processor.tokenizer.eos_token, "").replace(processor.tokenizer.pad_token, "")
    sequence = re.sub(r"<.*?>", "", sequence, count=1).strip()  # remove first task start token

    print(processor.token2json(sequence))

pdf_filepath = 'handbook.pdf' # PDF file path
question = "When do we need to adjust the temperature?"

Hello SaiKirtana,

I am also facing the same issue? Have you got any solution for this?
Please let me know at anujarjun11@gmail.com

Any Solutions?