Creating a docvqa dataset - gt_parses

@nielsr I’m currently going through your donut vqa fine-tuning experiment, and I’m confused about the gt_parses column.

  1. What is the correct format for multiple questions for each answer? Your tutorial shows the formatting for multiple answers for each question but not the converse.

  2. I noticed that in your docvqa_1200_examples_donut dataset, most rows only have one ground truth in the gt_parses list. Moreover, rows that have more than one ground truth actually have the question and answer repeated. Is this by design, or is there something wrong?

Thanks!

Hi,

The only reason they created gt_parses (rather than gt_parse) is because images in the DocVQA dataset have multiple annotations (as multiple human annotators were used to create question-answer pairs for each image).

Regarding rows having the same question and answer repeated, that might be an issue with this specific dataset, it definitely shouldn’t be like that.