Access CLIP from StableDiffusionPipeline and use the same models for multiple pipelines

Cajoek · October 22, 2022, 3:16pm

Hi I’m using both the StableDiffusionPipeline and StableDiffusionImg2ImgPipeline and I would also like to evaluate the generated images based on a few prompts with CLIP similarity. If I understand it correctly both the CLIP image and text encoder is used by the img2img pipeline.

Can I utilize the functionality of all three pipelines without having to load three separate pipelines to GPU? I.e. by loading only the StableDiffusionImg2ImgPipeline which should contain all components.

My code for StableDiffusionImg2ImgPipeline:

import torch
from PIL import Image

from diffusers import StableDiffusionImg2ImgPipeline

# load the pipeline
pipe = StableDiffusionImg2ImgPipeline.from_pretrained(
    "CompVis/stable-diffusion-v1-4",
    revision="fp16", 
    torch_dtype=torch.float16,
    use_auth_token=True
)
pipe = pipe.to("cuda")

init_image = Image.open("image.png").convert("RGB")

prompt = "A fantasy landscape, trending on artstation"

images = pipe(prompt=prompt, init_image=init_image, strength=0.75, guidance_scale=7.5).images

images[0].save("fantasy_landscape.png")

pcuenq · October 23, 2022, 11:03am

Hi @Cajoek!

Yes, you can reuse the components from a pipeline to create another one. Continuing from your code, you can do something like this to create a text to image pipeline:

t2i = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4",
    unet=pipe.unet,
    vae=pipe.vae,
    text_encoder=pipe.text_encoder,
    scheduler=pipe.scheduler,
    tokenizer=pipe.tokenizer,
)

This way you’ll share the some modules in both of the pipelines.

Hope that helps!

Cajoek · October 23, 2022, 2:01pm

Awesome thanks!

Regarding CLIP similarity, I figured out that I can access the CLIPTextModel with pipe.text_encoder and I can get the text embeddings with:

text_inputs = pipe.tokenizer(
    prompt,
    padding="max_length",
    max_length=pipe.tokenizer.model_max_length,
    return_tensors="pt",
)
text_input_ids = text_inputs.input_ids
text_input_ids = text_input_ids.to("cuda")

text_embeddings = pipe.text_encoder(text_input_ids).pooler_output
print(text_embeddings.shape)  #  [1, 768]

I can also access the CLIPVisionModel with pipe.safety_checker.vision_model.
And get the image embeddings with:

vision_input = pipe.feature_extractor(image, return_tensors="pt").pixel_values.to(torch.float16)
image_embeddings = pipe.safety_checker.vision_model(vision_input).pooler_output
print(image_embeddings.shape)  # [1, 1024]

Unfortunately it seems like different CLIP encoders are used for text and image (hidden size 768 vs 1024). I therefore tried creating a separate CLIPVisionModel to match pipe.text_encoder with the CLIP version stated in the docs: "openai/clip-vit-large-patch14":

from PIL import Image
from transformers import CLIPProcessor, CLIPVisionModel

image = Image.open("image.png").convert("RGB")

model = CLIPVisionModel.from_pretrained("openai/clip-vit-large-patch14")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-large-patch14")

inputs = processor(images=image, return_tensors="pt")
image_embeddings = model(**inputs).pooler_output

print(image_embeddings.shape)  # [1, 1024]

However this also outputs a vector of size 1024 not 768 as expected, what am I doing wrong @pcuenq ?

(I also get a lot of weights text_model.encoder.layers... not used errors for when creating the CLIPVisionModel which seems strange)

cortwave · October 11, 2023, 10:50pm

I think, the main reason, that ClipModel inside has linear projections for text and image embedding to project them to the same shape

github.com

huggingface/transformers/blob/e1cec43415e72c9853288d4e9325b734d36dd617/src/transformers/models/clip/modeling_clip.py#L995


      
              text_config = config.text_config
              vision_config = config.vision_config
          
              self.projection_dim = config.projection_dim
              self.text_embed_dim = text_config.hidden_size
              self.vision_embed_dim = vision_config.hidden_size
          
              self.text_model = CLIPTextTransformer(text_config)
              self.vision_model = CLIPVisionTransformer(vision_config)
          
              self.visual_projection = nn.Linear(self.vision_embed_dim, self.projection_dim, bias=False)
              self.text_projection = nn.Linear(self.text_embed_dim, self.projection_dim, bias=False)
              self.logit_scale = nn.Parameter(torch.tensor(self.config.logit_scale_init_value))
          
              # Initialize weights and apply final processing
              self.post_init()
          
          @add_start_docstrings_to_model_forward(CLIP_TEXT_INPUTS_DOCSTRING)
          def get_text_features(
              self,
              input_ids: Optional[torch.Tensor] = None,

Topic		Replies	Views
Help verify StableDiffusion & CLIP weight sharing 🧨 Diffusers	0	527	December 13, 2022
How to have one pipeline to perform text2img, img2img with shared Stable Diffusion model? 🧨 Diffusers	3	2570	September 23, 2022
Img2Img keeps devolving into psychedelics Beginners	0	540	September 28, 2022
Stable Diffusion CLIP similarity 🧨 Diffusers	6	4587	December 6, 2022
Generating and saving multiple images using img2img pipeline 🧨 Diffusers	4	13080	February 8, 2023

Access CLIP from StableDiffusionPipeline and use the same models for multiple pipelines

Related topics