Phi3 vision number of tokens

I am looking at using phi-3-vision models to try and describe an image. However, I couldn’t help but notice that the number of tokens that an image takes is quite large (~2000). Is this correct, or a potential bug? I have included a code snippet so that you can check my assumptions:

From my understanding of VLMs they simply take an image, and use CLIP or similar to project one image to one (or few tokens), so that they become a “language token”.

Side questions

Incase it helps me understand phi,

  1. Where is the 17 coming from in the below image shape.
  2. Why is the image_sizes (1, 2) and not (1, 1) given that I have only referenced one image.
import requests

from transformers import AutoProcessor

model_id = "microsoft/Phi-3-vision-128k-instruct"
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)

messages = [
    {"role": "user", "content": "<|image_1|>\nWhat is shown in this image?"},
]
url = "https://sm.ign.com/t/ign_ap/review/d/deadpool-r/deadpool-review_2s7s.1200.jpg"
image = Image.open(requests.get(url, stream=True).raw)

prompt = processor.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(prompt, [image], return_tensors="pt")

{k: v.shape for k, v in inputs.items()}
# {'input_ids': torch.Size([1, 2371]),
# 'attention_mask': torch.Size([1, 2371]),
# 'pixel_values': torch.Size([1, 17, 3, 336, 336]),
# 'image_sizes': torch.Size([1, 2])}

To accommodate high-resolution images and various aspect ratios, a dynamic cropping strategy is utilized to split the input image into a 2d array of blocks, where the tokens of the blocks are concatenated to represent the whole image.

According to the technical report Phi3-vision seems to be using technique similar to llava-1,6 by splitting an image into multiple sub-images and then concatenating their embeddings.