I have bumped into an issue of the `Trainer`

class for object detection, that was already found by others (also present as at least one issue in the repository), that the evaluation loop doesnâ€™t work well with this type of model.

To summarize, the `labels`

input for the DETR family of object detection is a list of dictionaries. For example:

```
[
{
'size': tensor([ 771, 1333]),
'image_id': tensor([1592]),
'class_labels': tensor([0]),
'boxes': tensor([[0.2268, 0.6586, 0.1567, 0.1480]]),
'area': tensor([23827.1230]),
'iscrowd': tensor([0]),
'orig_size': tensor([561, 970])
},
{
'size': tensor([1333, 763]),
'image_id': tensor([44]),
'class_labels': tensor([0]),
'boxes': tensor([[0.4216, 0.4584, 0.3794, 0.1979]]),
'area': tensor([76371.3984]),
'iscrowd': tensor([0]),
'orig_size': tensor([926, 530])
},
]
```

Each dictionary represents one image, with all its bounding boxes. Cool.

The problem arises when the trainer runs the evaluation loop and must concatenate each result. It concatenates the labels of every evaluation batch, using the internal `nested_concat`

.

Given two lists of stuff, this function creates a new list, in which the `i`

-th element is the concatenation of the `i`

-th elements of the two inputs. In the case above, where we have lists of dictionaries, it will concatenate dictionariesâ€¦ which means concatenating the values corresponding to each key.

The result? Something that might look like this:

```
[
{
'size': tensor([1093, 800, 800, 941]),
'image_id': tensor([4823, 4814]),
'class_labels': tensor([], dtype=torch.int64),
'boxes': tensor([], size=(0, 4)),
'area': tensor([]),
'iscrowd': tensor([], dtype=torch.int64),
'orig_size': tensor([1024, 749, 673, 792])
},
{
'size': tensor([ 595, 1332, 1058, 800]),
'image_id': tensor([ 12, 4824]),
'class_labels': tensor([0]),
'boxes': tensor([[0.6425, 0.7350, 0.2068, 0.3502]]),
'area': tensor([57399.0664]),
'iscrowd': tensor([0]),
'orig_size': tensor([434, 972, 967, 731])
},
```

Problems:

- Now each item in the list is a dictionary representing two images. Instead of a key
`size`

with the width and height, we have width1, height1, width2, height2, which is inconvenient at best. - Given two lists of different sizes, the result will truncate the longest.
- Images may have a variable number of boxes. Once they are concatenated, you have no way of knowing how many belong to which.

This makes it impossible to define a proper `compute_metrics`

function.

I would gladly submit a pull request, except that I donâ€™t even know what would be an appropriate fix that wouldnâ€™t break something else. I suppose the ideal solution would have to include something to signal the type of label that a model works with.

I do have a simple suggestion based on the assumption that lists of dictionaries should be concatenated, NOT have each of their dictionaries concatenated with each other.

Essentially, replace in `nested_concat`

```
if isinstance(tensors, (list, tuple)):
return type(tensors)(nested_concat(t, n, padding_index=padding_index) for t, n in zip(tensors, new_tensors))
```

with

```
if isinstance(tensors, (list, tuple)):
if len(tensors) and isinstance(tensors, list) and isinstance(tensors[0], Mapping):
return tensors + new_tensors
else:
return type(tensors)(nested_concat(t, n, padding_index=padding_index) for t, n in zip(tensors, new_tensors))
```

Please let me know if that looks good