Hi HF team and thanks for your amazing work, for my research I would like to use the CLIPModel “openai/clip-vit-base-patch32” through the VisionTextDualEncoderModel class. I’ve tried to use the from_pretrained()
method but it doesn’t support the clip_text_model
class of the text tower.
Do you have any suggestion from where to start to create such a script?