Object detection resolution fine-tuning

Most object detection models are trained for 640x640 resolution, I want to fine-tune a model to detect my classes of interest, but the camera that I will be using is a 16:9 camera, Is it (1) possible and is it a good idea (2) to fine-tune my model with an input resolution of 16:9? (Such as 540x960)

From what i can tell the Trainer’s TrainerArguments do have arguments for controlling image input resolutions, does it train with whatever the input image resolution is (which would likely be changed in the pre-processing step)

Note: For inference you can use the size argument in processor initialization

Note: I am interested for RT-DETRv2 architecture, but feel free to answer about a similar architecture like DETR if you are familiar with it.
Note: For what I can tell you can change the input image size during training with YOLO/Ultralytics But I’m not sure if the same concept applies to DETR object detectors.
Thanks!

1 Like

Is it (1) possible

As long as the processor works properly, there shouldn’t be any major problems. Some models, such as CLIP, seem to have hard-coded resolutions, but otherwise like this should be fine.

from transformers import DetrImageProcessor

image_processor = DetrImageProcessor.from_pretrained(
    "facebook/detr-resnet-50",
    do_resize=True,
    size={"height": 540, "width": 960},   # ← your 16:9 resolution
    default_to_square=False,
    do_pad=True,
    pad_size={"height": 540, "width": 960},
)

and is it a good idea (2)

There does not seem to be much of a negative impact on accuracy. However, since the existing weights are learned as squares, it may be necessary to perform thorough tuning on your own.