Phi3 vision number of tokens

sachin · June 18, 2024, 6:34am

I am looking at using phi-3-vision models to try and describe an image. However, I couldn’t help but notice that the number of tokens that an image takes is quite large (~2000). Is this correct, or a potential bug? I have included a code snippet so that you can check my assumptions:

From my understanding of VLMs they simply take an image, and use CLIP or similar to project one image to one (or few tokens), so that they become a “language token”.

Side questions

Incase it helps me understand phi,

Where is the 17 coming from in the below image shape.
Why is the image_sizes (1, 2) and not (1, 1) given that I have only referenced one image.

import requests

from transformers import AutoProcessor

model_id = "microsoft/Phi-3-vision-128k-instruct"
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)

messages = [
    {"role": "user", "content": "<|image_1|>\nWhat is shown in this image?"},
]
url = "https://sm.ign.com/t/ign_ap/review/d/deadpool-r/deadpool-review_2s7s.1200.jpg"
image = Image.open(requests.get(url, stream=True).raw)

prompt = processor.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(prompt, [image], return_tensors="pt")

{k: v.shape for k, v in inputs.items()}
# {'input_ids': torch.Size([1, 2371]),
# 'attention_mask': torch.Size([1, 2371]),
# 'pixel_values': torch.Size([1, 17, 3, 336, 336]),
# 'image_sizes': torch.Size([1, 2])}

RaushanTurganbay · June 18, 2024, 11:05am

To accommodate high-resolution images and various aspect ratios, a dynamic cropping strategy is utilized to split the input image into a 2d array of blocks, where the tokens of the blocks are concatenated to represent the whole image.

According to the technical report Phi3-vision seems to be using technique similar to llava-1,6 by splitting an image into multiple sub-images and then concatenating their embeddings.

Topic		Replies	Views
How to count input tokens in vision model? 🤗Transformers	4	650	December 19, 2024
ValueError: Image features and image tokens do not match 🤗Transformers	2	1711	April 14, 2025
LayoutLMV3 embeddings Beginners	4	1109	August 3, 2022
Layoutlmv2 token classification on documents having tokens larger than 512 Models	8	2324	October 20, 2022
Adding a New tokens to ViT 🤗Transformers	0	292	March 10, 2023

Phi3 vision number of tokens

Side questions

Related topics