Self-attention extraction from Long T5

Hi there, I’m trying to extract the encoder self-attention weights from Long T5 (using the LongT5Encoder model).
I can get encoder attention weights by adding the output_attentions=True arg in the forward method, but I’m having some problem in understanding the output, which is a tensor of shape (batch_size, num_layers, num_heads, 128, 3*128 + K), where I think that K depends on the sequence length, while 128 should be the relative_attention_max_distance.
When doing that for BERT-like models, last two dimensions are a squared matrix (for each column, i.e. for each token, we have the attention score respect to all other tokens), which is what I want. But, how can I achieve this with Long T5 model?
Thanks in advance!