What bounding boxes format does Grounding DINO use?

Grounding DINO in the transformers library uses the AnnotationFormat.COCO_DETECTION, but from what I can tell it format its bounding boxes in [x_min, y_min, x_max, y_max], while COCO bounding boxes are [x,y,width,height]

references: COCO - Common Objects in Context, Grounding DINO

1 Like

I think it will be converted internally if you pass it in this format.

The image_processor expects the annotations to be in the following format: {'image_id': int, 'annotations': list[Dict]}, where each dictionary is a COCO object annotation.