Grounding DINO in the transformers library uses the AnnotationFormat.COCO_DETECTION, but from what I can tell it format its bounding boxes in [x_min, y_min, x_max, y_max], while COCO bounding boxes are [x,y,width,height]
I think it will be converted internally if you pass it in this format.
The image_processor expects the annotations to be in the following format: {'image_id': int, 'annotations': list[Dict]}, where each dictionary is a COCO object annotation.