Hi all,
Can someone provide fine tuning tips for pix2struct where multimodal inputs are involved and text decoder as output? I m particularly interested in finetuning pix2struct model using docvqa type dataset.
Thanks in advance for the help.
Hi all,
Can someone provide fine tuning tips for pix2struct where multimodal inputs are involved and text decoder as output? I m particularly interested in finetuning pix2struct model using docvqa type dataset.
Thanks in advance for the help.