Access CLIP from StableDiffusionPipeline and use the same models for multiple pipelines

Hi I’m using both the StableDiffusionPipeline and StableDiffusionImg2ImgPipeline and I would also like to evaluate the generated images based on a few prompts with CLIP similarity. If I understand it correctly both the CLIP image and text encoder is used by the img2img pipeline.

Can I utilize the functionality of all three pipelines without having to load three separate pipelines to GPU? I.e. by loading only the StableDiffusionImg2ImgPipeline which should contain all components.

My code for StableDiffusionImg2ImgPipeline:

import torch
from PIL import Image

from diffusers import StableDiffusionImg2ImgPipeline

# load the pipeline
pipe = StableDiffusionImg2ImgPipeline.from_pretrained(
    "CompVis/stable-diffusion-v1-4",
    revision="fp16", 
    torch_dtype=torch.float16,
    use_auth_token=True
)
pipe = pipe.to("cuda")

init_image = Image.open("image.png").convert("RGB")

prompt = "A fantasy landscape, trending on artstation"

images = pipe(prompt=prompt, init_image=init_image, strength=0.75, guidance_scale=7.5).images

images[0].save("fantasy_landscape.png")

Hi @Cajoek!

Yes, you can reuse the components from a pipeline to create another one. Continuing from your code, you can do something like this to create a text to image pipeline:

t2i = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4",
    unet=pipe.unet,
    vae=pipe.vae,
    text_encoder=pipe.text_encoder,
    scheduler=pipe.scheduler,
    tokenizer=pipe.tokenizer,
)

This way you’ll share the some modules in both of the pipelines.

Hope that helps!

Awesome thanks!

Regarding CLIP similarity, I figured out that I can access the CLIPTextModel with pipe.text_encoder and I can get the text embeddings with:

text_inputs = pipe.tokenizer(
    prompt,
    padding="max_length",
    max_length=pipe.tokenizer.model_max_length,
    return_tensors="pt",
)
text_input_ids = text_inputs.input_ids
text_input_ids = text_input_ids.to("cuda")

text_embeddings = pipe.text_encoder(text_input_ids).pooler_output
print(text_embeddings.shape)  #  [1, 768]

I can also access the CLIPVisionModel with pipe.safety_checker.vision_model.
And get the image embeddings with:

vision_input = pipe.feature_extractor(image, return_tensors="pt").pixel_values.to(torch.float16)
image_embeddings = pipe.safety_checker.vision_model(vision_input).pooler_output
print(image_embeddings.shape)  # [1, 1024]

Unfortunately it seems like different CLIP encoders are used for text and image (hidden size 768 vs 1024). I therefore tried creating a separate CLIPVisionModel to match pipe.text_encoder with the CLIP version stated in the docs: "openai/clip-vit-large-patch14":

from PIL import Image
from transformers import CLIPProcessor, CLIPVisionModel

image = Image.open("image.png").convert("RGB")

model = CLIPVisionModel.from_pretrained("openai/clip-vit-large-patch14")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-large-patch14")

inputs = processor(images=image, return_tensors="pt")
image_embeddings = model(**inputs).pooler_output

print(image_embeddings.shape)  # [1, 1024]

However this also outputs a vector of size 1024 not 768 as expected, what am I doing wrong @pcuenq ?

(I also get a lot of weights text_model.encoder.layers... not used errors for when creating the CLIPVisionModel which seems strange)

1 Like

I think, the main reason, that ClipModel inside has linear projections for text and image embedding to project them to the same shape