Building on top of Korean Language models that are publicly available, we want to train a multimodal generative system. Specifically, we train CLIP on Korean datasets using KoBERT and ViT as backbones.
ViT, KoBERT (or any other Korean encoder LM)
KETI has released a Korean image captioning dataset, available on AI Hub. We can also utilize the WIT dataset, which is a multilingual dataset scraped from Wikipedia.
The training script is (almost) already available, see here.
KoBERT likely performs slightly worse than the default English BERT. While this could be a bottleneck, we can always write the script in such a way that it is easy to plug-and-play different LMs. This way, when a better Korean LM is released, it can easily be used.
There might be minor architectural adjustments we have to make (e.g. adding projection layers).
Attaining fluency in JAX will take time and effort.
The final deliverable of this project will most likely be an open source repository, accompanying documentation, model weights, and potentially a demo Streamlit app.
Hello @jaketae & Team,
I am interested to be a part of such an amazing project & team. I will try my best to contribute to the Korean version of the CLIP model. It would be nice if we could discuss some learning resources that would be useful for this project. I can work in any time zone that is comfortable for everyone in the team.
Hi @jaketae & Team,
I am interested to be a part of KoCLIP team. I am Korean and will try my best to contribute to this project as possible as i can. It would be nice if i could participate in this project and discuss about it. I can work in any time zone that is comfortable for everyone in the team.