How to perform batch inference on GroundingDino model

frames = [.....List of PIL.Image....]
 inputs = processor(images=frames, text=text, return_tensors="pt").to(device)

with torch.no_grad():
     outputs = model(**inputs)
 print(outputs)

Error:
RuntimeError: The size of tensor a (3) must match the size of tensor b (6) at non-singleton dimension 2

1 Like

I have the same problem… any updates?
Thanks!

Hi @royve, thanks for the question, would be nice to have minimal reproducing example and environment :slightly_smiling_face:

I was able to run a batched inference with the following env and code:

- `transformers` version: 4.44.0.dev0
- Platform: Linux-6.5.0-1020-aws-x86_64-with-glibc2.35
- Python version: 3.10.12
- PyTorch version (GPU?): 2.4.0+cu118 (True)
- GPU type: NVIDIA A10G
import requests

import torch
from PIL import Image
from transformers import AutoProcessor, AutoModelForZeroShotObjectDetection

model_id = "IDEA-Research/grounding-dino-tiny"
device = "cuda"

processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForZeroShotObjectDetection.from_pretrained(model_id).to(device)

image_url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(image_url, stream=True).raw)
images = [image, image]
texts = [
    "a cat. a remote control.",
    "a cat. a remote control. a sofa.",
]

inputs = processor(images=images, text=texts, padding=True, return_tensors="pt").to(device)

with torch.no_grad():
    outputs = model(**inputs)

w, h = image.size
results = processor.post_process_grounded_object_detection(
    outputs,
    inputs.input_ids,
    box_threshold=0.4,
    text_threshold=0.3,
    target_sizes=[(h, w), (h, w)],
)
print(results)
1 Like