Hi I’m using both the StableDiffusionPipeline and StableDiffusionImg2ImgPipeline and I would also like to evaluate the generated images based on a few prompts with CLIP similarity. If I understand it correctly both the CLIP image and text encoder is used by the img2img pipeline.
Can I utilize the functionality of all three pipelines without having to load three separate pipelines to GPU? I.e. by loading only the StableDiffusionImg2ImgPipeline which should contain all components.
My code for StableDiffusionImg2ImgPipeline:
import torch
from PIL import Image
from diffusers import StableDiffusionImg2ImgPipeline
# load the pipeline
pipe = StableDiffusionImg2ImgPipeline.from_pretrained(
"CompVis/stable-diffusion-v1-4",
revision="fp16",
torch_dtype=torch.float16,
use_auth_token=True
)
pipe = pipe.to("cuda")
init_image = Image.open("image.png").convert("RGB")
prompt = "A fantasy landscape, trending on artstation"
images = pipe(prompt=prompt, init_image=init_image, strength=0.75, guidance_scale=7.5).images
images[0].save("fantasy_landscape.png")
Yes, you can reuse the components from a pipeline to create another one. Continuing from your code, you can do something like this to create a text to image pipeline:
Unfortunately it seems like different CLIP encoders are used for text and image (hidden size 768 vs 1024). I therefore tried creating a separate CLIPVisionModel to match pipe.text_encoder with the CLIP version stated in the docs: "openai/clip-vit-large-patch14":
from PIL import Image
from transformers import CLIPProcessor, CLIPVisionModel
image = Image.open("image.png").convert("RGB")
model = CLIPVisionModel.from_pretrained("openai/clip-vit-large-patch14")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-large-patch14")
inputs = processor(images=image, return_tensors="pt")
image_embeddings = model(**inputs).pooler_output
print(image_embeddings.shape) # [1, 1024]
However this also outputs a vector of size 1024 not 768 as expected, what am I doing wrong @pcuenq ?
(I also get a lot of weights text_model.encoder.layers... not used errors for when creating the CLIPVisionModel which seems strange)