Self-attention masking for T5 encoder?

I’m interested in applying self-attention masking in the T5 encoder, meaning a tensor of shape (batch, max_input_len, max_input_len), which is a binary matrix for each tensor in the batch, specifying which tokens (i,j) in the input can attend to each other.
This idea is explored for a different Transformer architecture in the paper “Improving Compositional Generalization in Classification Tasks via Structure Annotations”. It’s implemented in TF, the figure below is from that paper:

In contrast, I’d like to use the Transformers PyTorch T5 implementation if possible. The option seems to be available for BERT:

But for T5 I haven’t been able to find an equivalent- the encoder_attention_mask there is of dimension (batch_size, encoder_seq_len):

Anyone have experience with this?