Hi,
I am trying to use GIT multimodal model (microsoft/git-base-textvqa) for Visual Question Answering. The shape of logits from the output of the forward function is not (batch_size, sequence_length, config.vocab_size) as mentioned in the documentation (GIT). Is this a bug ? Kindly help.
Code to reproduce the issue is given below:
from transformers import AutoProcessor, AutoModelForCausalLM
from huggingface_hub import hf_hub_download
from PIL import Image
processor = AutoProcessor.from_pretrained("microsoft/git-base-textvqa")
model = AutoModelForCausalLM.from_pretrained("microsoft/git-base-textvqa")
file_path = hf_hub_download(repo_id="nielsr/textvqa-sample", filename="bus.png", repo_type="dataset")
image = Image.open(file_path).convert("RGB")
pixel_values = processor(images=image, return_tensors="pt").pixel_values
question = "what does the front of the bus say at the top?"
input_ids = processor(text=question, add_special_tokens=False).input_ids
input_ids = [processor.tokenizer.cls_token_id] + input_ids
input_ids = torch.tensor(input_ids).unsqueeze(0)
print(input_ids.shape) # torch.Size([1, 13])
output = model(pixel_values=pixel_values, input_ids=input_ids)
logits = output.logits
print(logits.shape) # torch.Size([1, 914, 30522])