Our team is working on building an image captioning model which can generate a movie/t.v show plot from it’s poster.
The goal of this project is to create an image captioning model using a transformer encoder model like Vision Transformer (ViT) and a transformer decoder language model like GPT-2
Any vision based encoder and language model decoder would be a good candidate to train the VisionEncoderDecoderModel for image captioning. We are trying the following models first:
We are using publicly available IMDb datasets to train the model.
The main challenge is to create a good dataset of poster and movie plots. Also it will be interesting to see if the model gives good predictions for non-english movies/tv shows.
We will create a Streamlit or Gradio app on Spaces that can predict a movie/t.v show plot from it’s poster.