How to condition Stable-Diffusion on CLIP image embeddings?

philippwulff222 · February 4, 2024, 11:33pm

I saw that the last hidden state of the CLIP text features is passed to stable diffusion. It has shape [B, 77, 1024]. I am instantiating the text-related components of CLIP from the stable-diffusion checkpoints.

My problem is that the last hidden state of the CLIP image features has shape [B, 257, 1024]. I need to instantiate the CLIPVisionModel from another repo, because stabilityai/stable-diffusion-2-1-base does not have it.

This is my code:

model_id = "stabilityai/stable-diffusion-2-1-base"

text = ""

tokenizer = AutoTokenizer.from_pretrained(model_id, subfolder="tokenizer", use_fast=False,)
text_encoder = CLIPTextModel.from_pretrained(model_id, subfolder="text_encoder")

image_processor = CLIPImageProcessor.from_pretrained(model_id, subfolder="feature_extractor")
image_encoder = CLIPVisionModel.from_pretrained("laion/CLIP-ViT-H-14-laion2B-s32B-b79K")

# text
text_inputs = tokenizer(["a photo of a cat"], max_length=tokenizer.model_max_length, padding="max_length", truncation=True, return_tensors="pt")
text_features = text_encoder(**text_inputs)   # [1, 77, 1024]
print(text_features.last_hidden_state.shape)

# image
image_inputs = image_processor(images=image, return_tensors="pt")
image_features = image_encoder(**image_inputs)    # [1, 257, 1280]
print(image_features.last_hidden_state.shape)

Is it even possible to replace the text embeddings with image embedding directly?
If yes, then how can I get the same shapes?
I think that the vision- and text-encoder outputs are passed through a projection layer before loss computation during training. Would I need to map the image_features to the projection space and from there to the text embedding space using the linear projections?

Topic		Replies	Views
Stable Diffusion CLIP similarity 🧨 Diffusers	6	4589	December 6, 2022
Stable diffusion text_to_image.py discussion 🧨 Diffusers	1	360	May 22, 2023
Replace text encoder with a different encoder in Stable Diffusion 🧨 Diffusers	0	1429	February 9, 2024
CLIP Embedding Order For Stable Diffusion 🧨 Diffusers	3	4051	April 10, 2023
Replace Stable Diffusion class-conditional text with rows of attributes 🧨 Diffusers	0	445	January 27, 2024

How to condition Stable-Diffusion on CLIP image embeddings?

Related topics