How to create encoder mask, decoder causal masks for batchsize >1 in Transformers

Hi all, I had referred to Lewis’ NLP with Transformers book and I read the chapter “Anatomy” of Transformers and it helped me a lot with the Multiheaded attention and the causal mask in the decoder. I implemented a speech recognition transformer training code with just one batch size as this chapter shows examples with one batch size. One batch size lets me create the decoder causal mask just for the length of the target_seq. Now the code works and it’s training but it’s too slow. So I need to expand it to be able to create and apply correct masks for both the encoder and decoder with batchsize > 1. I tried to follows the transformers code in the HF transformer lib but it’s too complicated with memory and cache etc and I quickly lose track of the code flow. Please can you point me to a simple example code of Transformers encoder and decoder causal masks creations for the variable length target seqs in the decoder and the encoder, decoder’s forward() methods and the full basic training code i.e with the loss computation over all the target_seqs in the decoder which will be of variable lengths. Kind regards Vivek