I saw that the last hidden state of the CLIP text features is passed to stable diffusion. It has shape [B, 77, 1024]
. I am instantiating the text-related components of CLIP from the stable-diffusion checkpoints.
My problem is that the last hidden state of the CLIP image features has shape [B, 257, 1024]
. I need to instantiate the CLIPVisionModel
from another repo, because stabilityai/stable-diffusion-2-1-base
does not have it.
This is my code:
model_id = "stabilityai/stable-diffusion-2-1-base"
text = ""
tokenizer = AutoTokenizer.from_pretrained(model_id, subfolder="tokenizer", use_fast=False,)
text_encoder = CLIPTextModel.from_pretrained(model_id, subfolder="text_encoder")
image_processor = CLIPImageProcessor.from_pretrained(model_id, subfolder="feature_extractor")
image_encoder = CLIPVisionModel.from_pretrained("laion/CLIP-ViT-H-14-laion2B-s32B-b79K")
# text
text_inputs = tokenizer(["a photo of a cat"], max_length=tokenizer.model_max_length, padding="max_length", truncation=True, return_tensors="pt")
text_features = text_encoder(**text_inputs) # [1, 77, 1024]
print(text_features.last_hidden_state.shape)
# image
image_inputs = image_processor(images=image, return_tensors="pt")
image_features = image_encoder(**image_inputs) # [1, 257, 1280]
print(image_features.last_hidden_state.shape)
- Is it even possible to replace the text embeddings with image embedding directly?
- If yes, then how can I get the same shapes?
- I think that the vision- and text-encoder outputs are passed through a projection layer before loss computation during training. Would I need to map the
image_features
to the projection space and from there to the text embedding space using the linear projections?