Hi. I understand that CLIPSeg
incorporates a frozen openai/clip-vit-base-patch16
model. However I don’t get the same results when I try to extract image features with the CLIP model of CIDAS/clipseg-rd64-refined
or with openai/clip-vit-base-patch16
.
More precisely, if I extract text features with both models:
from transformers import CLIPSegForImageSegmentation, CLIPSegProcessor
from transformers import CLIPProcessor, CLIPModel
from PIL import Image
import torch
import requests
DEVICE = "cuda:0" if torch.cuda.is_available() else "cpu"
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch16").to(DEVICE)
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch16")
model2 = CLIPSegForImageSegmentation.from_pretrained("CIDAS/clipseg-rd64-refined").to(DEVICE)
processor2 = CLIPSegProcessor.from_pretrained("CIDAS/clipseg-rd64-refined")
inputs = processor(text=["test"], return_tensors="pt", padding=True)
result = model.get_text_features(**inputs.to(DEVICE))
inputs = processor2(text=["test"], return_tensors="pt", padding=True)
result2 = model2.clip.get_text_features(**inputs.to(DEVICE))
result-result2
… I get a tensor filled with zeros, as expected.
However if I now try to extract image features,
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
inputs = processor(images=[image], return_tensors="pt")
result = model.get_image_features(**inputs.to(DEVICE))
inputs2 = processor2(images=[image], return_tensors="pt")
result2 = model2.clip.get_image_features(**inputs2.to(DEVICE))
result-result2
… I get a tensor with non-zero (and not negligible) values. In fact, inputs
and inputs2
are different. In the documentation, the image processors for CLIPSeg
(ViTImageProcessor
) and CLIP
(CLIPImageProcessor
) don’t seem to be the same and I don’t understand why.
In the code above, if inputs2
is replaced with inputs
at the last line, then result2
and result
are equal (ie. the visual encoders are the same as expected).
Any hint would be much appreciated. Many thanks!