I am reading the code of chapter 10 of your book. However, what I am missing is to understand where exactly the masking step is happening. To my understanding, at some part of the code some tokens should be masked out and predicted by the model. Does it happen implicitely when we impoer AutoModelForCausalLM ?
Also, I would like to know if the model uses only the un-masked tokens which belong to the group of tokens appearing between two EOS tokens? i.e., how does the model know that it shouldn’t use the tokens of the previous sentence to predict the masked token of the next sentence? In this case the previous sentence is a piece of code which doesn’t include any information about the next code and shouldn’t be used in preicting.