`target_sizes` and `output.logits` do not align in `image_processor.post_process_object_detection`

I am trying to finetune RT-DETR on two GPUs following this script. The batch size is 8 using 2 GPUs (8 per GPU).

It seems that when reaching the compute_metric method, I seem to get a mismatch between output.logits and target_sizes. The batch dimension of output.logits is 8 while that of target_sizes is 16. This is the stacktrace message:

 File "/home/jb/.cache/pypoetry/virtualenvs/ml-Mf12zaqr-py3.11/lib/python3.11/site-packages/transformers/models/rt_detr/image_processing_rt_detr.py", line 1062, in post_process_object_detection
    raise ValueError(
ValueError: Make sure that you pass in as many target sizes as the batch dimension of the logits
 50%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ                                                                                  | 77/154 [00:39<00:39,  1.95it/s]

I suspect that the target_sizes tensor is gathering all the images from all devices when maybe it shouldn’t? I would appreciate any help!