Next sentence prediction on custom model

I’m trying to use a BERT-based model (jeniya/BERTOverflow · Hugging Face) to do Next Sentence Prediction. This is essentially a BERT model that has been pretrained on StackOverflow data.

Now, to pretrain it, they should have obviously used the Next Sentence Prediction task. But when I do a AutoModelForNextSentencePrediction.from_pretrained("jeniya/BERTOverflow"), I get a warning message saying:

Some weights of BertForNextSentencePrediction were not initialized from the model checkpoint at jeniya/BERTOverflow and are newly initialized: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

Now, I get that the message is telling me that the NSP head does not come with this model and so has been initialized randomly. My question is, if they have published pre-trained a BERT model on some custom data, shouldn’t they also have used an NSP head for their pretraining objective? If so, where did that head go? Did they just throw it away?

If so, how would I go about getting this custom model to work for the task of NSP? Should I pre-train the whole goddamn thing again, but this time not throw away the NSP head? Or can I simply do something like use AutoModel, and extract the [CLS] token representation, and put a MLP on top of that and train it with a few examples to do NSP? The former would be infeasible given the compute requirements. I feel like the latter is just wrong. Am I missing something?

Any help would be greatly appreciated! Thank you!

1 Like

Hey there @msamogh

I am facing a similar problem as yours: have you discovered something since the time you created this thread?
Also, if you know it, does this mean that models with architecture “BertForMaskedLM” have been trained ONLY on MLM, and not on NSP, and so I have to do that again?

Why? Follow-up papers have shown that NSP does not contribute much, if anything at all. (RoBERTa dropped it completely, Albert uses sentence order prediction.) The authors were likely aware of these findings and did not feel the need to include this task. Note, by the way, that the BERT weights also do not include the NSP weights.

To make this work, you’d have to finetune on this task specifically. You can use the pretrained model as a feature extractor (frozen) and add a classification head on top, or finetune end-to-end, or unfreeze gradually. If you have plenty of data and compute, you can train from-scratch but as you note that may not be feasible.