I wanted to fine-tune ViLT(Vision Language Model) for my task. In my dataset, I have 10 images with 1 text. For ViltForImagesAndTextClassification, I can increase the number of images using ViltConfig. But I am not able to preprocess the dataset using ViltProcessor through a Dataloader.
Is it possible to pass images and text in a Batch to ViLTProcessor? If possible, Can anyone help me how to do that?
Thanks in advance.