Incorrect logits shape for GIT model

Hi,
I am trying to use GIT multimodal model (microsoft/git-base-textvqa) for Visual Question Answering. The shape of logits from the output of the forward function is not (batch_size, sequence_length, config.vocab_size) as mentioned in the documentation (GIT). Is this a bug ? Kindly help.

Code to reproduce the issue is given below:

from transformers import AutoProcessor, AutoModelForCausalLM
from huggingface_hub import hf_hub_download
from PIL import Image

processor = AutoProcessor.from_pretrained("microsoft/git-base-textvqa")
model = AutoModelForCausalLM.from_pretrained("microsoft/git-base-textvqa")

file_path = hf_hub_download(repo_id="nielsr/textvqa-sample", filename="bus.png", repo_type="dataset")
image = Image.open(file_path).convert("RGB")

pixel_values = processor(images=image, return_tensors="pt").pixel_values

question = "what does the front of the bus say at the top?"

input_ids = processor(text=question, add_special_tokens=False).input_ids
input_ids = [processor.tokenizer.cls_token_id] + input_ids
input_ids = torch.tensor(input_ids).unsqueeze(0)
print(input_ids.shape) # torch.Size([1, 13])

output = model(pixel_values=pixel_values, input_ids=input_ids)
logits = output.logits
print(logits.shape) # torch.Size([1, 914, 30522])

Answered here: Incorrect logits shape for GIT model (microsoft/git-base-textvqa) 路 Issue #33107 路 huggingface/transformers 路 GitHub

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.