How to create encoder mask, decoder causal masks for batchsize >1 in Transformers

vivektyagiibm · July 21, 2023, 8:34am

Hi all, I had referred to Lewis’ NLP with Transformers book and I read the chapter “Anatomy” of Transformers and it helped me a lot with the Multiheaded attention and the causal mask in the decoder. I implemented a speech recognition transformer training code with just one batch size as this chapter shows examples with one batch size. One batch size lets me create the decoder causal mask just for the length of the target_seq. Now the code works and it’s training but it’s too slow. So I need to expand it to be able to create and apply correct masks for both the encoder and decoder with batchsize > 1. I tried to follows the transformers code in the HF transformer lib but it’s too complicated with memory and cache etc and I quickly lose track of the code flow. Please can you point me to a simple example code of Transformers encoder and decoder causal masks creations for the variable length target seqs in the decoder and the encoder, decoder’s forward() methods and the full basic training code i.e with the loss computation over all the target_seqs in the decoder which will be of variable lengths. Kind regards Vivek

Topic		Replies	Views
Replace Causal Mask of T5 to custom mask Models	3	417	October 29, 2024
Decoder Causal Masking [Keras] Intermediate	4	55	June 30, 2025
Where does causal mask get generated for T5 decoder? Beginners	2	662	January 9, 2024
Difference between transformer encoder and decoder Models	1	11805	March 12, 2021
Difference Between Attention Mask and Causal Mask 🤗Transformers	1	6823	September 2, 2024

How to create encoder mask, decoder causal masks for batchsize >1 in Transformers

Related topics