Use custom text encoder in CLIP

ThreeBlessings · January 22, 2024, 1:21pm

Hi guys!
I want to train a CLIP model for Ukrainian and would like to replace a text encoder with pretrained Ukrainian text encoder. I found a couple of discussions about CLIP implemented in Flax for the Spanish and Korean languages and a research papers about replacing CLIP text encoder for other languages (AltCLIP), but they all kinda reimplement CLIP for that.

Is there an easy way to initialize a CLIPModel but with custom text encoder (available on hf hub)? Same question for replacing the tokenizer in the CLIPProcessor.

I would like something like this:

model = CLIPModel(text_encoder=my_text_encoder, image_encoder=my_image_encoder)
processor = CLIPProcessor(tokenizer=my_tokenizer, image_processor=my_image_processor)

syrineM · January 22, 2024, 1:57pm

Hi, the solution i found is to use VisionTextDualEncoder class rather than CLIPModel class of transformers library where you can customize the vision and text encoders : " This model can be used to align the vision-text embeddings using CLIP like contrastive image-text training and then can be used for zero-shot vision tasks such image-classification or retrieval."
Not sure all text encoders architectures are compatible but worth a try !

ThreeBlessings · January 22, 2024, 2:07pm

Thank you so much! This looks like exactly what I need! Another thing I didn’t mention was that it’s actually for SigLIP model and as I see VisionTextDualEncoder only uses CLIP loss for now. I will create a PR to make it work with SigLIP loss as well

system · January 23, 2024, 2:07am

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Converting CLIPModel to VisionTextDualEncoderModel 🤗Transformers	1	163	March 21, 2024
CLIP like contrastive vision-language models for German with pre-traind text and vision models Flax/JAX Projects	5	1830	July 4, 2021
Custom VLM - Swapping a vision encoder from a VLM 🤗Transformers	1	219	March 19, 2025
CLIP like contrastive vision-language models for Spanish with pre-trained text and vision models Flax/JAX Projects	4	397	June 29, 2021
Stable diffusion text_to_image.py discussion 🧨 Diffusers	1	362	May 22, 2023

Use custom text encoder in CLIP

Related topics