CLIP model incorporated in CLIPSeg

Hi. I understand that CLIPSeg incorporates a frozen openai/clip-vit-base-patch16 model. However I don’t get the same results when I try to extract image features with the CLIP model of CIDAS/clipseg-rd64-refined or with openai/clip-vit-base-patch16.

More precisely, if I extract text features with both models:

from transformers import CLIPSegForImageSegmentation, CLIPSegProcessor
from transformers import CLIPProcessor, CLIPModel
from PIL import Image
import torch
import requests

DEVICE = "cuda:0" if torch.cuda.is_available() else "cpu"

model = CLIPModel.from_pretrained("openai/clip-vit-base-patch16").to(DEVICE)
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch16")

model2 = CLIPSegForImageSegmentation.from_pretrained("CIDAS/clipseg-rd64-refined").to(DEVICE)
processor2 = CLIPSegProcessor.from_pretrained("CIDAS/clipseg-rd64-refined")

inputs = processor(text=["test"], return_tensors="pt", padding=True)
result = model.get_text_features(**inputs.to(DEVICE))
inputs = processor2(text=["test"], return_tensors="pt", padding=True)
result2 = model2.clip.get_text_features(**inputs.to(DEVICE))
result-result2

… I get a tensor filled with zeros, as expected.

However if I now try to extract image features,

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

inputs = processor(images=[image], return_tensors="pt")
result = model.get_image_features(**inputs.to(DEVICE))
inputs2 = processor2(images=[image], return_tensors="pt")
result2 = model2.clip.get_image_features(**inputs2.to(DEVICE))
result-result2

… I get a tensor with non-zero (and not negligible) values. In fact, inputs and inputs2 are different. In the documentation, the image processors for CLIPSeg (ViTImageProcessor) and CLIP (CLIPImageProcessor) don’t seem to be the same and I don’t understand why.

In the code above, if inputs2 is replaced with inputs at the last line, then result2 and result are equal (ie. the visual encoders are the same as expected).

Any hint would be much appreciated. Many thanks!