Potential bug in the rt-detr v2 fine tune script

Hi,

I have been running this script in google colab.

First of all, thank you very much for this, super clear!

I have noticed a potential bug when running the trainer.evaluate command.

The original is working fine however as soon I try to run it for a single record or with a batch size creating a single record batch (let’s say 9 since eval batch size default to 8) then it fails.

ValueError                                Traceback (most recent call last)
in <cell line: 0>()
----> 1 metrics2 = trainer.evaluate(eval_dataset=t_dataset, metric_key_prefix="eval")

4 frames
in collect_targets(self, targets, image_sizes)
     33                 # here we have "yolo" format (x_center, y_center, width, height) in relative coordinates 0..1
     34                 # and we need to convert it to "pascal" format (x_min, y_min, x_max, y_max) in absolute coordinates
---> 35                 height, width = size
     36                 boxes = torch.tensor(target["boxes"])
     37                 boxes = center_to_corners_format(boxes)

ValueError: not enough values to unpack (expected 2, got 1)

If you want to recreate the error:

 t_dataset = CPPE5Dataset(dataset["test"].select([1]), image_processor, transform=validation_transform)

metrics = trainer.evaluate(eval_dataset=t_dataset, metric_key_prefix="eval")

or

t_dataset = CPPE5Dataset(dataset["test"].select(list(range(2))), image_processor, transform=validation_transform)

metrics = trainer.evaluate(eval_dataset=t_dataset, metric_key_prefix="eval")

After diving a bit more in the error, realised that the model prediction seem to select only the first index for all the values in the label dictionary when the batch size is one. That is why the image_size only has tensor([480]) instead of tensor([480,480]).

Hope you guys can help! Thank you

2 Likes

We ran into a similar issue. For now the “fix” is to ensure that a batch can never have 1 element in it. In our case the batch size is 8, so we make sure that the validation_df never has a remainder of 1 when dividing by 8.

e.g.

    validation_size = len(validation_df)
    remainder = validation_size % 8
    if remainder == 1:
        sample_to_move = train_df.iloc[0:1]
        train_df = train_df.iloc[1:]
        validation_df = pd.concat([validation_df, sample_to_move]).reset_index(drop=True)
2 Likes

@nicholasgcoles
Since you’ve run into this issue at the evaluation stage, you’ve managed to train successfully it seems. I’m running onto a problem here:
from transformers import AutoModelForObjectDetection

model = AutoModelForObjectDetection.from_pretrained(
    checkpoint,
    id2label=id2label,
    label2id=label2id,
    ignore_mismatched_sizes=True,
)

RuntimeError: Error(s) in loading state_dict for RTDetrV2ForObjectDetection:
	size mismatch for model.denoising_class_embed.weight: copying a param with shape torch.Size([81, 256]) from checkpoint, the shape in current model is torch.Size([6, 256]).
	size mismatch for model.enc_score_head.weight: copying a param with shape torch.Size([80, 256]) from checkpoint, the shape in current model is torch.Size([5, 256]).
	size mismatch for model.enc_score_head.bias: copying a param with shape torch.Size([80]) from checkpoint, the shape in current model is torch.Size([5]).
            ...

This seems to relate to the classes (coco’s 80 vs the cppe-5’s 5 classes). What’s odd is the ignore_mismatched_sizes=True doesn’t ignore this

Did either of you run into this?
cc @qubvel-hf as you authored this tutorial

Any help would be greatly appreciated

FYI I’m running the notebook on colab, setup below (though also ran into this on my local Ubuntu 20.04, x86 machine)
numpy version: 1.26.4
transformers version: 4.50.0.dev0
torch version: 2.5.1+cu124
Python 3.11.11
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 79
model name : Intel(R) Xeon(R) CPU @ 2.20GHz

1 Like

Seems related:

From what I can tell this also affects D-fine which uses the same processor as RT-DETR and RT-DETRv2 (RTDetrImageProcessor)

Is there any solution for the script?

1 Like

If we want to fix it forcefully, it would be something like this…?

    def collect_targets(self, targets, image_sizes):
        post_processed_targets = []
        for target_batch, image_size_batch in zip(targets, image_sizes):
            if image_size_batch.ndim == 1: # <=
                image_size_batch = image_size_batch.unsqueeze(0) # <=
            for target, size in zip(target_batch, image_size_batch):

                # here we have "yolo" format (x_center, y_center, width, height) in relative coordinates 0..1
                # and we need to convert it to "pascal" format (x_min, y_min, x_max, y_max) in absolute coordinates
                height, width = size
                boxes = torch.tensor(target["boxes"])
                boxes = center_to_corners_format(boxes)
                boxes = boxes * torch.tensor([[width, height, width, height]])

To fix it fundamentally, fixing in upstream is maybe necessary, but it seems that the issue has been raised and remains unresolved.