Owl-vit batch images inference

Dear hugging face users,

I’m trying to implement batch images inference on Owl-Vit. At the moment, I’m working on a set of 11 images, with 72 labels and batch_size=2. I get information how to implement batch size from here:


with the only different I’m using “google/owlvit-large-patch14” model instead of “google/owlvit-large-patch32”. The code works well for first two images, but on third, I get:

RuntimeError: shape '[4, 37, 768]' is invalid for input of size 115200


with torch.no_grad():
    outputs = model(**inputs)

I don’t understand what such shapes are. Are referring to image in process or the underlying net? Maybe I made some mistakes? I’m using too much labels? Thanks.

cc @adirik