Next sentence prediction on custom model

msamogh · June 17, 2021, 1:48am

I’m trying to use a BERT-based model (jeniya/BERTOverflow · Hugging Face) to do Next Sentence Prediction. This is essentially a BERT model that has been pretrained on StackOverflow data.

Now, to pretrain it, they should have obviously used the Next Sentence Prediction task. But when I do a AutoModelForNextSentencePrediction.from_pretrained("jeniya/BERTOverflow"), I get a warning message saying:

Some weights of BertForNextSentencePrediction were not initialized from the model checkpoint at jeniya/BERTOverflow and are newly initialized: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

Now, I get that the message is telling me that the NSP head does not come with this model and so has been initialized randomly. My question is, if they have published pre-trained a BERT model on some custom data, shouldn’t they also have used an NSP head for their pretraining objective? If so, where did that head go? Did they just throw it away?

If so, how would I go about getting this custom model to work for the task of NSP? Should I pre-train the whole goddamn thing again, but this time not throw away the NSP head? Or can I simply do something like use AutoModel, and extract the [CLS] token representation, and put a MLP on top of that and train it with a few examples to do NSP? The former would be infeasible given the compute requirements. I feel like the latter is just wrong. Am I missing something?

Any help would be greatly appreciated! Thank you!

Neuroinformatica · October 12, 2021, 1:22pm

Hey there @msamogh

I am facing a similar problem as yours: have you discovered something since the time you created this thread?
Also, if you know it, does this mean that models with architecture “BertForMaskedLM” have been trained ONLY on MLM, and not on NSP, and so I have to do that again?

BramVanroy · October 12, 2021, 6:13pm

Why? Follow-up papers have shown that NSP does not contribute much, if anything at all. (RoBERTa dropped it completely, Albert uses sentence order prediction.) The authors were likely aware of these findings and did not feel the need to include this task. Note, by the way, that the BERT weights also do not include the NSP weights.

To make this work, you’d have to finetune on this task specifically. You can use the pretrained model as a feature extractor (frozen) and add a classification head on top, or finetune end-to-end, or unfreeze gradually. If you have plenty of data and compute, you can train from-scratch but as you note that may not be feasible.

aaru-80 · May 14, 2024, 7:53am

@BranVanroy, thank you for your inputs; it helped clarifying my understanding about NextSentencePrediction model from Huggingface transformers.
Regarding NSP task, noticed the research paper “[2109.03564] NSP-BERT: A Prompt-based Few-Shot Learner Through an Original Pre-training Task--Next Sentence Prediction” mentioning the approach (NSP) is relevant and helps making the model perform better on zero-shot or few-shot settings. Would like to put this across for your comments and inputs if possible. Thank you

Topic		Replies	Views
BERT Next Sentence Prediction: How to do predictions? Beginners	5	7573	September 29, 2022
Keep NSP head after BertForPretraining 🤗Transformers	1	343	February 1, 2022
BERT next sentence prediction: bert-base always returns false Models	3	811	April 26, 2023
How to train BERT from scratch on a new domain for both MLM and NSP? Models	2	2299	February 6, 2021
Continual pre-training from an initial checkpoint with MLM and NSP Models	4	4290	September 8, 2021

Next sentence prediction on custom model

Related topics