How to plot an attention map for Vision Transformer model

I am implementing a Vision Transformer model as part of a school project and I am required to plot an attention map to compare the differences between a CNN model and ViT model, but I am not sure how to go about doing it.

For reference, I have been referring to this notebook for the code, except that I used google/vit-base-patch16-224-in21k for the ViT model
https://mpolinowski.github.io/docs/IoT-and-Machine-Learning/ML/2023-08-03-tensorflow-i-know-flowers-deit/2023-08-03/#deit-model

This is the output from vit_model.summary():

Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 input_1 (InputLayer)        [(None, 224, 224, 3)]     0         
                                                                 
 sequential (Sequential)     (None, 3, 224, 224)       0         
                                                                 
 vit (TFViTMainLayer)        TFBaseModelOutputWithPo   29686272  
                             oling(last_hidden_state             
                             =(None, 197, 768),                  
                              pooler_output=(None, 7             
                             68),                                
                              hidden_states=None, at             
                             tentions=None)                      
                                                                 
 tf.__operators__.getitem (  (None, 768)               0         
 SlicingOpLambda)                                                
                                                                 
 dense (Dense)               (None, 2)                 1538      
                                                                 
=================================================================
Total params: 29687810 (113.25 MB)
Trainable params: 29687810 (113.25 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________

These are the configurations for the model:

ViTConfig {
  "_name_or_path": "google/vit-base-patch16-224-in21k",
  "attention_probs_dropout_prob": 0.0,
  "encoder_stride": 16,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.0,
  "hidden_size": 768,
  "image_size": 224,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "model_type": "vit",
  "num_attention_heads": 8,
  "num_channels": 3,
  "num_hidden_layers": 4,
  "patch_size": 16,
  "qkv_bias": true,
  "transformers_version": "4.38.2"
}

I tried to extract the activation and output layer from the original model like so, but I am unsure of how to reshape the np arrays in order to get the weights to match a 224x224 image:

activation_layer = vit_model.get_layer("vit")
new_model = Model(inputs = vit_model.input, outputs = activation_layer.output)
final_dense = vit_model.get_layer('dense')
W = final_dense.get_weights()[0]

Any help would be appreciated!