KoCLIP: Pretraining CLIP on Korean


Building on top of Korean Language models that are publicly available, we want to train a multimodal generative system. Specifically, we train CLIP on Korean datasets using KoBERT and ViT as backbones.


ViT, KoBERT (or any other Korean encoder LM)


KETI has released a Korean image captioning dataset, available on AI Hub. We can also utilize the WIT dataset, which is a multilingual dataset scraped from Wikipedia.

Training Script

The training script is (almost) already available, see here.


  1. KoBERT likely performs slightly worse than the default English BERT. While this could be a bottleneck, we can always write the script in such a way that it is easy to plug-and-play different LMs. This way, when a better Korean LM is released, it can easily be used.
  2. There might be minor architectural adjustments we have to make (e.g. adding projection layers).
  3. Attaining fluency in JAX will take time and effort.

Desired Outcomes

The final deliverable of this project will most likely be an open source repository, accompanying documentation, model weights, and potentially a demo Streamlit app.


Awesome Idea! I’m also very interested in Korean NLP:)
Count me in!

Hello, I’d like to join this project! I have experience with Transformers, BERT and Pytorch. Interested in multi-modal learning and Korean NLP as well!

1 Like

Great! Iet me join this fun project!

1 Like

Thanks everyone! Please feel free to share any thoughts you have regarding the direction or details of this project. Looking forward to the next couple of weeks.

just so that we could set up meetings a bit more easily I think it would be good if we share which time zone each of us is in. I’m in GMT+9!

Agreed. I mean I don’t think timezone should ever prevent anyone from joining, but for the purposes of arranging logistics, it would certainly be helpful. I’m also on KST/GMT+9.

I’m on PST / GMT -7. We can also use this tool to set up the first meeting : https://www.when2meet.com/

Hello @jaketae & Team,
I am interested to be a part of such an amazing project & team. I will try my best to contribute to the Korean version of the CLIP model. It would be nice if we could discuss some learning resources that would be useful for this project. I can work in any time zone that is comfortable for everyone in the team.

Great! let’s officially define this project :slight_smile:

Putting everybody in the official sheet here. More people can still join! Leave a comment here or on the sheet if you want to change something.

1 Like

IMHO aligning Korean encoders with the pre-trained CLIP text encoder will probably suffice. It would be great if we could do better. Count me in :relaxed: (I’m on GMT+9)

Added you to the team :slight_smile:


@junhsss @devtrent @srisweet @tree-park Hi! I have opened up a discord channel at Flax-HuggingFace-Community-Week for our project it will be great if all of us join and start sharing ideas


Hi @jaketae & Team,
I am interested to be a part of KoCLIP team. I am Korean and will try my best to contribute to this project as possible as i can. It would be nice if i could participate in this project and discuss about it. I can work in any time zone that is comfortable for everyone in the team.

1 Like

Hey all! This seems like a mighty interesting project to partake :smiley:
I’m a Korean data engineer with prior NLP experience.
Please count me in if there are any spots left!

1 Like

Hi @amphora , @kyungeun added you to the team :slight_smile: