RT-DETR attention map dimension - PekingU/rtdetr_r50vd

im using the below model:

pretrained_model = tr.RTDetrForObjectDetection.from_pretrained(
pretrained_model_name_or_path="PekingU/rtdetr_r50vd", 
output_attentions=True)

the decoder output attention map dimension is (batch_size,num_queries, num_heads, 3, 4), although the documentaiton is say that it should be (batch_size,num_queries, num_heads, 4, 4). now there is nothing in the documentation explaining what are these numbers exactly. i suspcted that the first 4 is for num_features and the second 4 is for the offset of the Deformable Attention.

when i change the configuration of the num_featuer to 4, during inference, i got the below error

TypeError: conv2d() received an invalid combination of arguments - got (list, Parameter, NoneType, tuple, tuple, tuple, int), but expected one of:

  • (Tensor input, Tensor weight, Tensor bias = None, tuple of ints stride = 1, tuple of ints padding = 0, tuple of ints dilation = 1, int groups = 1)
    didn’t match because some of the arguments have invalid types: (list of [Tensor, Tensor, Tensor], Parameter, NoneType, tuple of (int, int), tuple of (int, int), tuple of (int, int), int)
  • (Tensor input, Tensor weight, Tensor bias = None, tuple of ints stride = 1, str padding = “valid”, tuple of ints dilation = 1, int groups = 1)
    didn’t match because some of the arguments have invalid types: (list of [Tensor, Tensor, Tensor], Parameter, NoneType, tuple of (int, int), tuple of (int, int), tuple of (int, int), int)

so my question, do u know that are the first and second 4 in the output dimension? and why im getting 3,4 ? and how to fix it ?

1 Like