Fine tuning tips for Pix2Struct DOCVQA

Hi all,

Can someone provide fine tuning tips for pix2struct where multimodal inputs are involved and text decoder as output? I m particularly interested in finetuning pix2struct model using docvqa type dataset.
Thanks in advance for the help.