CLIP model incorporated in CLIPSeg

vivien · February 22, 2023, 12:22am

Hi. I understand that CLIPSeg incorporates a frozen openai/clip-vit-base-patch16 model. However I don’t get the same results when I try to extract image features with the CLIP model of CIDAS/clipseg-rd64-refined or with openai/clip-vit-base-patch16.

More precisely, if I extract text features with both models:

from transformers import CLIPSegForImageSegmentation, CLIPSegProcessor
from transformers import CLIPProcessor, CLIPModel
from PIL import Image
import torch
import requests

DEVICE = "cuda:0" if torch.cuda.is_available() else "cpu"

model = CLIPModel.from_pretrained("openai/clip-vit-base-patch16").to(DEVICE)
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch16")

model2 = CLIPSegForImageSegmentation.from_pretrained("CIDAS/clipseg-rd64-refined").to(DEVICE)
processor2 = CLIPSegProcessor.from_pretrained("CIDAS/clipseg-rd64-refined")

inputs = processor(text=["test"], return_tensors="pt", padding=True)
result = model.get_text_features(**inputs.to(DEVICE))
inputs = processor2(text=["test"], return_tensors="pt", padding=True)
result2 = model2.clip.get_text_features(**inputs.to(DEVICE))
result-result2

… I get a tensor filled with zeros, as expected.

However if I now try to extract image features,

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

inputs = processor(images=[image], return_tensors="pt")
result = model.get_image_features(**inputs.to(DEVICE))
inputs2 = processor2(images=[image], return_tensors="pt")
result2 = model2.clip.get_image_features(**inputs2.to(DEVICE))
result-result2

… I get a tensor with non-zero (and not negligible) values. In fact, inputs and inputs2 are different. In the documentation, the image processors for CLIPSeg (ViTImageProcessor) and CLIP (CLIPImageProcessor) don’t seem to be the same and I don’t understand why.

In the code above, if inputs2 is replaced with inputs at the last line, then result2 and result are equal (ie. the visual encoders are the same as expected).

Any hint would be much appreciated. Many thanks!

Topic		Replies	Views
Load CLIP pretrained model on GPU Beginners	6	8183	March 6, 2024
CLIPModel finetuning Models	9	9185	July 20, 2022
Converting CLIPModel to VisionTextDualEncoderModel 🤗Transformers	1	163	March 21, 2024
Encoding video frames using CLIP 🤗Transformers	0	1339	June 12, 2022
Converting weights to .safetensors with HF format -> CLIP-L is ruined. Why? Beginners	18	1227	September 21, 2024

CLIP model incorporated in CLIPSeg

Related topics