Converting CLIPModel to VisionTextDualEncoderModel

Hi HF team and thanks for your amazing work, for my research I would like to use the CLIPModel “openai/clip-vit-base-patch32” through the VisionTextDualEncoderModel class. I’ve tried to use the from_pretrained() method but it doesn’t support the clip_text_model class of the text tower.

Do you have any suggestion from where to start to create such a script?

cc @valhalla who might know this, since he worked on adding both CLIP and VisionTextDualEncoder to the Transformers library