Using trasnsformer to get image features

Hello to everybody,
I would like to use one transformers tro extract features from images at different level, as for example it is possible to do with convolutional network that we can take the ouput from convolutional layer at level 3 or 4 or 5.
Can be a possibility to do something like that with tranformers?

Or, and that is more interesting, how I can get the features with the positional embending from the tranformers and use them as input in other kind of network with attention.

Sure it can be done (for the 1st question), To extract features use the bare model, for instance, if we are using ViT the naming convention for the bare model is ViTModel & by default *most models returns last_hidden_state (last layer) and pooler_output. To get all layers set output_hidden_states=True (line 10) in the forward pass. Now you can access all the layers, you can play with them with the index.

Consider this code

1. from transformers import ViTFeatureExtractor, ViTModel
2. import torch
3. from datasets import load_dataset

4. dataset = load_dataset("huggingface/cats-image")
5. image = dataset["test"]["image"][0]

6. feature_extractor = ViTFeatureExtractor.from_pretrained("google/vit-base-patch16-224-in21k")
7. model = ViTModel.from_pretrained("google/vit-base-patch16-224-in21k")

8. inputs = feature_extractor(image, return_tensors="pt")

9. with torch.no_grad():
10.     outputs = model(**inputs, output_hidden_states=True)

I try this solution but when I pass an image as input which is a 3-channels image in batch of dimension 1 but I receive the error: keyError: ((1, 1, 224, 224), ‘|u1’) and I don’t find where the image become 1 channel

1 Like