Dynamic attention mask during GPT-2 training

My task is to generate a list of options and a story, given an intro.

an option is a sentence starting with a special token ‘<|option|>’.
an instance is something like “intro <|endofintro|> <|option|> option1 <|option|> option2 <|endofoption> story < endofstory>”

I need a dynamic attention mask because
(1) As I don’t want the following options to attention to the previous options, I should mask all the previous options. For example, I could set the attention mask on all the options pos to 0.
(2) However, as the story should attention to all the options, I should keep the attention mask on all the options pos to 1

I could simply implement that during generation by generating one option per time, and after generating options, I could modify the attention mask to generate the story.

But I don’t know how to do that during training.