CLIP like contrastive vision-language models for German with pre-traind text and vision models

CLIP like contrastive vision-language models for German with pre-trained text and vision models

For this project, a pre-trained image model like ViT and a pre-trained text model like BERT can be used as an image encoder and text encoder respectively.

Model

Pre-trained ViT, BERT models can be found on the model hub. We could also use multi-lingual BERT/ROBERTa models for the German language.

Datasets

The WIT dataset can be used for this task.

Available training scripts

A training script for this will be provided soon. (see PR)

(Optional) Desired project outcome

The desired outcome is to train a CLIP-like model for German language. This can be showcased with a streamlit or gradio app.

(Optional) Challenges

This model will require some modifications to the existing models. Specifically, we will need to add projection layers in both the text and image encoder models.

(Optional) Links to read upon

3 Likes

It’s quite an interesting project ,count me in :grinning:

1 Like

Hi guys!

Sounds like an interesting topic to get started with Jax/Flax and CLIP.
My main background is from recommender systems and NLP but I am interested in overlaps with CV, which makes this a great place to start!

Count me in! :slight_smile:

1 Like

Great defining this project! cc @valhalla

This project sounds interesting! I would love to join it. I have experience with GANs, Computer Vision, Pytorch Lightning, and at the same time studying German (B1). It would be a great learning experience for me. Please let me know how can I join?

1 Like

Where can I find the project in discord? Whats the channel name?