Owl-vit batch images inference

Dear hugging face users,

I’m trying to implement batch images inference on Owl-Vit. At the moment, I’m working on a set of 11 images, with 72 labels and batch_size=2. I get information how to implement batch size from here:

https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/zeroshot_object_detection_with_owlvit.ipynb#scrollTo=-Wc92cWK-Aas

with the only different I’m using “google/owlvit-large-patch14” model instead of “google/owlvit-large-patch32”. The code works well for first two images, but on third, I get:

RuntimeError: shape '[4, 37, 768]' is invalid for input of size 115200

here:

with torch.no_grad():
    outputs = model(**inputs)

I don’t understand what such shapes are. Are referring to image in process or the underlying net? Maybe I made some mistakes? I’m using too much labels? Thanks.

cc @adirik