IndoClip : Pre Training Clip for Indonesian dataset

munggok · June 24, 2021, 6:45pm

CLIP like contrastive vision-language models for Indonesian with pre-trained text and vision models

For this project, a pre-trained image model like ViT and a pre-trained text model like BERT can be used as an image encoder and text encoder respectively.

Model

Pre-trained ViT, BERT models can be found on the model hub. We could also use multi-lingual BERT/RoBERTa models for the Indonesian language.

Datasets

The WIT dataset can be used for this task.
The GEM dataset can also be used for the task

Available training scripts

A training script for this will be provided soon. (see PR)

(Optional) Desired project outcome

The desired outcome is to train a CLIP-like model for Indonesian language. This can be showcased with a streamlit or gradio app.

(Optional) Challenges

This model will require some modifications to the existing models. Specifically, we will need to add projection layers in both the text and image encoder models.

gkuwanto · June 29, 2021, 5:36pm

Hello @munggok .Are you still looking for team members for this project? I’m very interested with this project

patrickvonplaten · June 30, 2021, 12:39pm

Perfect, 2 is enough - let’s finalize the project

munggok · June 30, 2021, 1:01pm

hi @gkuwanto sure, come join

@patrickvonplaten thanks patrick

Topic		Replies	Views
CLIP like contrastive vision-language models for German with pre-traind text and vision models Flax/JAX Projects	5	1830	July 4, 2021
Image captioning for Indonesia with pre-trained vision Flax/JAX Projects	4	486	June 29, 2021
CLIP like contrastive vision-language models for Spanish with pre-trained text and vision models Flax/JAX Projects	4	397	June 29, 2021
Image captioning for French with pre-trained vision and text model Flax/JAX Projects	6	2163	January 4, 2022
Image captioning for Japanese with pre-trained vision and text model Flax/JAX Projects	0	1173	June 23, 2021