Does NSP corrupts context during pre-training?

kaansonmezoz · May 30, 2021, 6:37pm

During the pre-training of BERT, if we just use MLM our input will be: [CLS] SentenceA [SEP]. So if there is a masked token in SentenceA, it will be predicted by sentenceA’s context.

If we use MLM + NSP, our input will be [CLS] SentenceA [SEP] SentenceB [SEP]. In this case, the masked token in sentenceA will be predicted by both contexts of sentenceA and sentenceB. Words in the sentence B will contribute final state of masked token, even if they never appeared together.

For positive sampling sentenceA and sentenceB might share some common context since they appear in the same document, but for negative sampling they event don’t share that context.

Does NSP corrupts context during pre-training ? How BERT does deal this corruption in pre-training ?

from transformers import BertTokenizer, BertForMaskedLM
model = BertForMaskedLM.from_pretrained('bert-base-uncased')
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

x = tokenizer.encode_plus("The man went to [MASK] store", "He bought a gallon of milk", return_tensors="pt")

masked_index = torch.where(x["input_ids"] == tokenizer.mask_token_id)[1].tolist()[0]

y = model(x["input_ids"], x["token_type_ids"])
for prediction in y[0][:, masked_index].topk(5).indices.tolist()[0]:
    print(tokenizer.decode(word_prediction))

It outputs as: when, then, once, and, so

Topic		Replies	Views
Continual pre-training from an initial checkpoint with MLM and NSP Models	4	4282	September 8, 2021
NSP + WWM raises error when training BertForPreTraining Beginners	0	630	May 2, 2021
How to train BERT from scratch on a new domain for both MLM and NSP? Models	2	2290	February 6, 2021
Does BERT Use Two Segments of a Sequence When Predicting the Masks? Beginners	0	313	June 14, 2022
BERT Next Sentence Prediction: How to do predictions? Beginners	5	7531	September 29, 2022

Does NSP corrupts context during pre-training?

Related topics