I’m using this code snippet from the docs of HuggingFace ViT classification model - with one addition: I’m using the output_attentions=True
parameter. Nevertheless, no attentions are returned.
from transformers import ViTFeatureExtractor, ViTForImageClassification
from PIL import Image
import requests
url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
image = Image.open(requests.get(url, stream=True).raw)
feature_extractor = ViTFeatureExtractor.from_pretrained('google/vit-base-patch16-224')
model = ViTForImageClassification.from_pretrained('google/vit-base-patch16-224',output_attentions=True)
inputs = feature_extractor(images=image, return_tensors="pt")
outputs = model(**inputs)
logits = outputs.logits
# --> this should print the attentions
print(output.attentions)
# model predicts one of the 1000 ImageNet classes
predicted_class_idx = logits.argmax(-1).item()
print("Predicted class:", model.config.id2label[predicted_class_idx])
The output of print(output.attentions)
is:
attentions=(None, None, None, None, None, None, None, None, None, None, None, None)
What am I doing wrong, and how can I get the attentions values?