Poster2Plot: Generate Movie/T.V show plot from poster

dsr · November 21, 2021, 9:44pm

Description

Our team is working on building an image captioning model which can generate a movie/t.v show plot from it’s poster.

The goal of this project is to create an image captioning model using a transformer encoder model like Vision Transformer (ViT) and a transformer decoder language model like GPT-2

Model(s)

Any vision based encoder and language model decoder would be a good candidate to train the VisionEncoderDecoderModel for image captioning. We are trying the following models first:

Encoder: google/vit-base-patch16-224-in21k
Decoder: gpt2

Datasets

We are using publicly available IMDb datasets to train the model.
Some examples:

Challenges

The main challenge is to create a good dataset of poster and movie plots. Also it will be interesting to see if the model gives good predictions for non-english movies/tv shows.

Desired project outcomes

We will create a Streamlit or Gradio app on Spaces that can predict a movie/t.v show plot from it’s poster.

dk-crazydiv · November 22, 2021, 1:21am

Let’s give it a try.

dsr · November 22, 2021, 7:10pm

@dk-crazydiv and I were able to train a VisionEncoderDecoderModel to generate movie/t.v show plot from poster. We used google/vit-base-patch16-224-in21k encoder and gpt2 decoder.

We have uploaded the model to model hub poster2plot

@lewtun Link to the Gradio app on Spaces poster2plot

We are still working on improving the model.

lewtun · November 23, 2021, 2:50pm

Wow, this is an incredibly cool project and Space that you’ve created - great job! Thank you for taking part in the course event

nielsr · November 23, 2021, 2:56pm

Awesome work @dsr and team!

dk-crazydiv · November 24, 2021, 3:38am

Thank you. It was indeed very fun for us to build and super fun to manually test as well. The course content along with some existing code snippets helped a lot. The ease with which one could build and demo projects like these in HF ecosystem cannot be praised enough. The entire ecosystem cuts down the entire idea to deployment time by at least 10x. With the hub, datasets, abstractions in transformer lib and spaces, we were able to do in a day, what would take probably weeks.

Topic		Replies	Views
Image Captioning with ViT and GPT 2 Base Models	2	63	May 10, 2025
Image captioning for Japanese with pre-trained vision and text model Flax/JAX Projects	0	1173	June 23, 2021
Image captioning for Spanish with pre-trained vision and text model Flax/JAX Projects	13	2486	July 19, 2021
Img2seq model with pretrained weights Beginners	7	1216	November 18, 2021
Vision-Language Project Ideas Flax/JAX Projects	13	1551	June 30, 2021