How to condition Stable-Diffusion on CLIP image embeddings?

I saw that the last hidden state of the CLIP text features is passed to stable diffusion. It has shape [B, 77, 1024]. I am instantiating the text-related components of CLIP from the stable-diffusion checkpoints.

My problem is that the last hidden state of the CLIP image features has shape [B, 257, 1024]. I need to instantiate the CLIPVisionModel from another repo, because stabilityai/stable-diffusion-2-1-base does not have it.

This is my code:

model_id = "stabilityai/stable-diffusion-2-1-base"

text = ""

tokenizer = AutoTokenizer.from_pretrained(model_id, subfolder="tokenizer", use_fast=False,)
text_encoder = CLIPTextModel.from_pretrained(model_id, subfolder="text_encoder")

image_processor = CLIPImageProcessor.from_pretrained(model_id, subfolder="feature_extractor")
image_encoder = CLIPVisionModel.from_pretrained("laion/CLIP-ViT-H-14-laion2B-s32B-b79K")

# text
text_inputs = tokenizer(["a photo of a cat"], max_length=tokenizer.model_max_length, padding="max_length", truncation=True, return_tensors="pt")
text_features = text_encoder(**text_inputs)   # [1, 77, 1024]

# image
image_inputs = image_processor(images=image, return_tensors="pt")
image_features = image_encoder(**image_inputs)    # [1, 257, 1280]
  • Is it even possible to replace the text embeddings with image embedding directly?
  • If yes, then how can I get the same shapes?
  • I think that the vision- and text-encoder outputs are passed through a projection layer before loss computation during training. Would I need to map the image_features to the projection space and from there to the text embedding space using the linear projections?