Questions regarding the multi-modal setup on MM-Imbd dataset

Hi all, thank you so much for the wonderful service.

I have some doubts regarding the training details for MM-Imdb dataset.

  1. Are the image encoders and tokenizer embeddings fine-tuned during training on MM-Imdb dataset? If not, can you suggest a way to do it or refer any material for help?

  2. Is there a way to modify the code so that the model’s pre-trained weights can be used for sequence-to-sequence generations tasks instead of classification?

Any suggestions or comments will be of great help.

Thank You