Possible fix for trainer evaluation with object detection

I have bumped into an issue of the Trainer class for object detection, that was already found by others (also present as at least one issue in the repository), that the evaluation loop doesn’t work well with this type of model.

To summarize, the labels input for the DETR family of object detection is a list of dictionaries. For example:

[
    {
        'size': tensor([ 771, 1333]),
        'image_id': tensor([1592]),
        'class_labels': tensor([0]),
        'boxes': tensor([[0.2268, 0.6586, 0.1567, 0.1480]]),
        'area': tensor([23827.1230]),
        'iscrowd': tensor([0]),
        'orig_size': tensor([561, 970])
    },
    {
        'size': tensor([1333,  763]),
        'image_id': tensor([44]),
        'class_labels': tensor([0]),
        'boxes': tensor([[0.4216, 0.4584, 0.3794, 0.1979]]),
        'area': tensor([76371.3984]),
        'iscrowd': tensor([0]),
        'orig_size': tensor([926, 530])
    },
]

Each dictionary represents one image, with all its bounding boxes. Cool.

The problem arises when the trainer runs the evaluation loop and must concatenate each result. It concatenates the labels of every evaluation batch, using the internal nested_concat.

Given two lists of stuff, this function creates a new list, in which the i-th element is the concatenation of the i-th elements of the two inputs. In the case above, where we have lists of dictionaries, it will concatenate dictionaries… which means concatenating the values corresponding to each key.

The result? Something that might look like this:

[
    {
        'size': tensor([1093,  800,  800,  941]),
        'image_id': tensor([4823, 4814]),
        'class_labels': tensor([], dtype=torch.int64),
        'boxes': tensor([], size=(0, 4)),
        'area': tensor([]),
        'iscrowd': tensor([], dtype=torch.int64),
        'orig_size': tensor([1024,  749,  673,  792])
    },
    {
        'size': tensor([ 595, 1332, 1058,  800]),
        'image_id': tensor([  12, 4824]),
        'class_labels': tensor([0]),
        'boxes': tensor([[0.6425, 0.7350, 0.2068, 0.3502]]),
        'area': tensor([57399.0664]),
        'iscrowd': tensor([0]),
        'orig_size': tensor([434, 972, 967, 731])
    },

Problems:

  • Now each item in the list is a dictionary representing two images. Instead of a key size with the width and height, we have width1, height1, width2, height2, which is inconvenient at best.
  • Given two lists of different sizes, the result will truncate the longest.
  • Images may have a variable number of boxes. Once they are concatenated, you have no way of knowing how many belong to which.

This makes it impossible to define a proper compute_metrics function.

I would gladly submit a pull request, except that I don’t even know what would be an appropriate fix that wouldn’t break something else. I suppose the ideal solution would have to include something to signal the type of label that a model works with.

I do have a simple suggestion based on the assumption that lists of dictionaries should be concatenated, NOT have each of their dictionaries concatenated with each other.

Essentially, replace in nested_concat

    if isinstance(tensors, (list, tuple)):
        return type(tensors)(nested_concat(t, n, padding_index=padding_index) for t, n in zip(tensors, new_tensors))

with

    if isinstance(tensors, (list, tuple)):
        if len(tensors) and isinstance(tensors, list) and isinstance(tensors[0], Mapping):
            return tensors + new_tensors
        else:
            return type(tensors)(nested_concat(t, n, padding_index=padding_index) for t, n in zip(tensors, new_tensors))

Please let me know if that looks good :slight_smile: