Donut fine tuning question

Hi,

I have followed Document AI: Fine-tuning Donut for document-parsing using Hugging Face Transformers (philschmid.de) setup on fine tuning donut on custom data set. I am using a csv file to input that looks like this:

image_path,ground_truth
AccessCode.png,“{”“gt_parse”“:{”“roll_number”“:”“050 065 14020 0000"”,““tax_year””:““REPRINT-2014"”,”“tax_amount”“:”“7784"”,““tax_due_date””:“”\u2018Aug. 14, 2014"“,”“property_address”“:”“1234 FRANCIS ST PLAN 1274 LOT 15 NRSFR”“,”“municipality”“:”“City of CAMBRIDGE”“,”“borrower_name”“:”“DOE JOHN DOE JANE”“}}”

I managed to train the model. But for sample purpose i was only using 9 training data rows.I tried to run the code to test this, but i’m only getting a empty response back. Is this due to my low data set used for training?

from transformers import DonutProcessor, VisionEncoderDecoderModel
import re
import torch
from PIL import Image

model = 'C:/ocr/model_trained_donut'
fileName = 'C:/ocr/AccessCode.png'

processor = DonutProcessor.from_pretrained(model)
model = VisionEncoderDecoderModel.from_pretrained(model)

image = Image.open(fileName)
pixel_values = processor(image, return_tensors="pt").pixel_values
print(pixel_values.shape)

task_prompt = "<s>"
decoder_input_ids = processor.tokenizer(task_prompt, add_special_tokens=False, return_tensors="pt")["input_ids"]

device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)

outputs = model.generate(pixel_values.to(device),
                               decoder_input_ids=decoder_input_ids.to(device),
                               max_length=model.decoder.config.max_position_embeddings,
                               early_stopping=False,
                               pad_token_id=processor.tokenizer.pad_token_id,
                               eos_token_id=processor.tokenizer.eos_token_id,
                               use_cache=True,
                               num_beams=1,
                               bad_words_ids=[[processor.tokenizer.unk_token_id]],
                               return_dict_in_generate=True,
                               output_scores=True,)
							   

sequence = processor.batch_decode(outputs.sequences)[0]
sequence = sequence.replace(processor.tokenizer.eos_token, "").replace(processor.tokenizer.pad_token, "")
sequence = re.sub(r"<.*?>", "", sequence, count=1).strip()  # remove first task start token
print(sequence)							   

print(processor.token2json(sequence))

output

Could not find image processor class in the image processor config or the model config. Loading based on pattern matching with the model's feature extractor configuration.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
model loaded
C:\Users\thaban.segaran\AppData\Roaming\Python\Python39\site-packages\transformers\generation\utils.py:1421: UserWarning: You have modified the pretrained model configuration to control generation. This is a deprecated strategy to control generation and will be removed soon, in a future version. Please use and modify the model generation configuration (see https://huggingface.co/docs/transformers/generation_strategies#default-text-generation-configuration )
  warnings.warn(
C:\Users\thaban.segaran\AppData\Roaming\Python\Python39\site-packages\transformers\generation\configuration_utils.py:399: UserWarning: `num_beams` is set to 1. However, `early_stopping` is set to `True` -- this flag is only used in beam-based generation modes. You should set `num_beams>1` or unset `early_stopping`.
  warnings.warn(
Reference:

Prediction:
 {'text_sequence': '<s></s>'}