Hi @dilawn, have you figured this out yet by any chance? I have just posted a similar question on not finding causal masks in RobertaForCausalLM and BertGeneration classes. I am just wondering if I misunderstand the concept or missing the line of code that applies these causal masks.