Hi guys!
I want to train a CLIP model for Ukrainian and would like to replace a text encoder with pretrained Ukrainian text encoder. I found a couple of discussions about CLIP implemented in Flax for the Spanish and Korean languages and a research papers about replacing CLIP text encoder for other languages (AltCLIP), but they all kinda reimplement CLIP for that.
Is there an easy way to initialize a CLIPModel but with custom text encoder (available on hf hub)? Same question for replacing the tokenizer in the CLIPProcessor.
I would like something like this:
model = CLIPModel(text_encoder=my_text_encoder, image_encoder=my_image_encoder)
processor = CLIPProcessor(tokenizer=my_tokenizer, image_processor=my_image_processor)
Hi, the solution i found is to use VisionTextDualEncoder class rather than CLIPModel class of transformers library where you can customize the vision and text encoders : " This model can be used to align the vision-text embeddings using CLIP like contrastive image-text training and then can be used for zero-shot vision tasks such image-classification or retrieval."
Not sure all text encoders architectures are compatible but worth a try !
Thank you so much! This looks like exactly what I need! Another thing I didn’t mention was that it’s actually for SigLIP model and as I see VisionTextDualEncoder only uses CLIP loss for now. I will create a PR to make it work with SigLIP loss as well