@nielsr
In the above line the first token from the output of transformer is used to do classification logit computation. I am confused because in the source code the learnable classification token is not at the zeroth index. The cls_token is concatenated at the start of image patch tokens
visual_tokens = torch.cat([cls_token, visual_path_embeddings])
and to create transformer input the text+bbox inuputs and image tokens are concatenated as follows,
transformer_inp = torch.cat([text_embeddings, visual_embedding])
this means that my classification is at index 512 considering that we put limit of 512 tokens on text inputs.
This is just for clarification, using first token for classification also does a fine job.
Thanks