Hello to everybody,
I would like to use one transformers tro extract features from images at different level, as for example it is possible to do with convolutional network that we can take the ouput from convolutional layer at level 3 or 4 or 5.
Can be a possibility to do something like that with tranformers?
Or, and that is more interesting, how I can get the features with the positional embending from the tranformers and use them as input in other kind of network with attention.
Sure it can be done (for the 1st question), To extract features use the bare model, for instance, if we are using ViT the naming convention for the bare model is ViTModel & by default *most models returns last_hidden_state (last layer) and pooler_output. To get all layers set output_hidden_states=True (line 10) in the forward pass. Now you can access all the layers, you can play with them with the index.
I try this solution but when I pass an image as input which is a 3-channels image in batch of dimension 1 but I receive the error: keyError: ((1, 1, 224, 224), â|u1â) and I donât find where the image become 1 channel
Thank you for this valuable information. I just have a question. I tried it and I got a tuple comprising 13 tensors. Each tensor shape is 1x12x197x768. My question is: How can I find the class token feature vector?