CLIP like contrastive vision-language models for German with pre-trained text and vision models
For this project, a pre-trained image model like
ViT and a pre-trained text model like
BERT can be used as an image encoder and text encoder respectively.
Pre-trained ViT, BERT models can be found on the model hub. We could also use multi-lingual BERT/ROBERTa models for the German language.
The WIT dataset can be used for this task.
Available training scripts
A training script for this will be provided soon. (see PR)
(Optional) Desired project outcome
The desired outcome is to train a CLIP-like model for German language. This can be showcased with a streamlit or gradio app.
This model will require some modifications to the existing models. Specifically, we will need to add projection layers in both the text and image encoder models.
(Optional) Links to read upon
It’s quite an interesting project ,count me in
I would be happy to join this project. I’m super excited about JAX and have a little experience with it. My main background lies in computer Vision with Tensorflow and i have some experience with CLIP-like architectures and zero-shot classification.
Hope I can add something to the project!
Sounds like an interesting topic to get started with Jax/Flax and CLIP.
My main background is from recommender systems and NLP but I am interested in overlaps with CV, which makes this a great place to start!
Count me in!
Hi! The project sounds great. I already implemented a CLIP like model with Pytorch Lightning / Timm and transformers (Universal Sentence Encoder and RoBERTa). Looking forward to work with CLIP, JAX/Flax and TPUs. So count me in
Great defining this project! cc @valhalla
This project sounds interesting! I would love to join it. I have experience with GANs, Computer Vision, Pytorch Lightning, and at the same time studying German (B1). It would be a great learning experience for me. Please let me know how can I join?
Where can I find the project in discord? Whats the channel name?