I am trying to pre-train a BERT model with different approaches. I am trying to find out what happens when there is a sequence of two sentences. For example, let’s say we are using a dual-task pre-training with MLM and NSP. There is an input sequence as:
[CLS] the man [MASK] to the store [SEP] penguin [MASK] are flight ##less birds [SEP]
So when predicting the first masked token in the first part of the sequence, does the BERT model take into consideration the second part of the sequence? If it is how? Is it by a control mechanism over token_type_ids specifically for the MLM task? If it isn’t then doesn’t it care about the segments and take all the sequence into consideration?