Does NSP corrupts context during pre-training?

During the pre-training of BERT, if we just use MLM our input will be: [CLS] SentenceA [SEP]. So if there is a masked token in SentenceA, it will be predicted by sentenceA’s context.

If we use MLM + NSP, our input will be [CLS] SentenceA [SEP] SentenceB [SEP]. In this case, the masked token in sentenceA will be predicted by both contexts of sentenceA and sentenceB. Words in the sentence B will contribute final state of masked token, even if they never appeared together.

For positive sampling sentenceA and sentenceB might share some common context since they appear in the same document, but for negative sampling they event don’t share that context.

Does NSP corrupts context during pre-training ? How BERT does deal this corruption in pre-training ?

from transformers import BertTokenizer, BertForMaskedLM
model = BertForMaskedLM.from_pretrained('bert-base-uncased')
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

x = tokenizer.encode_plus("The man went to [MASK] store", "He bought a gallon of milk", return_tensors="pt")

masked_index = torch.where(x["input_ids"] == tokenizer.mask_token_id)[1].tolist()[0]

y = model(x["input_ids"], x["token_type_ids"])
for prediction in y[0][:, masked_index].topk(5).indices.tolist()[0]:
    print(tokenizer.decode(word_prediction))

It outputs as: when, then, once, and, so