I am implementing a Vision Transformer model as part of a school project and I am required to plot an attention map to compare the differences between a CNN model and ViT model, but I am not sure how to go about doing it.
For reference, I have been referring to this notebook for the code, except that I used google/vit-base-patch16-224-in21k for the ViT model
https://mpolinowski.github.io/docs/IoT-and-Machine-Learning/ML/2023-08-03-tensorflow-i-know-flowers-deit/2023-08-03/#deit-model
This is the output from vit_model.summary():
Model: "model"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_1 (InputLayer) [(None, 224, 224, 3)] 0
sequential (Sequential) (None, 3, 224, 224) 0
vit (TFViTMainLayer) TFBaseModelOutputWithPo 29686272
oling(last_hidden_state
=(None, 197, 768),
pooler_output=(None, 7
68),
hidden_states=None, at
tentions=None)
tf.__operators__.getitem ( (None, 768) 0
SlicingOpLambda)
dense (Dense) (None, 2) 1538
=================================================================
Total params: 29687810 (113.25 MB)
Trainable params: 29687810 (113.25 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
These are the configurations for the model:
ViTConfig {
"_name_or_path": "google/vit-base-patch16-224-in21k",
"attention_probs_dropout_prob": 0.0,
"encoder_stride": 16,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.0,
"hidden_size": 768,
"image_size": 224,
"initializer_range": 0.02,
"intermediate_size": 3072,
"layer_norm_eps": 1e-12,
"model_type": "vit",
"num_attention_heads": 8,
"num_channels": 3,
"num_hidden_layers": 4,
"patch_size": 16,
"qkv_bias": true,
"transformers_version": "4.38.2"
}
I tried to extract the activation and output layer from the original model like so, but I am unsure of how to reshape the np arrays in order to get the weights to match a 224x224 image:
activation_layer = vit_model.get_layer("vit")
new_model = Model(inputs = vit_model.input, outputs = activation_layer.output)
final_dense = vit_model.get_layer('dense')
W = final_dense.get_weights()[0]
Any help would be appreciated!