BERT Next Sentence Prediction: How to do predictions?

Let’s say I have a pretrained BERT model (pretrained using NSP and MLM tasks as usual) on a large custom dataset.

The way I understand NSP to work is you take the embedding corresponding to the [CLS] token from the final layer and pass it onto a Linear layer that reduces it to 2 dimensions. Then, you apply a softmax on top of it to get predictions on whether the pair of sentences are consecutive or not.

Now, the published pretrained model I have does not include this “NSP head”, so I have to train one myself. How do I do this? Since the only parameters I presume I’ll need to tweak are the ones of the linear layer, will a small dataset be enough for this?

Why is the convention to throw away this NSP head? Isn’t it a useful thing to publish for others to use?

Have you found a solution for this?

hi @nicir .

Next Sentence Prediction (NSP) just train each sentence’s relation.

What is your train task?
Using pre-trained model to fine-tuning at some next sentence prediction task? or model doesn’t trained NSP loss, only trained with MLM loss?
Either both way, this class with hf document will be helpful to you : BertForNectSentencePrediction

in that code, there are BertOnlyNSP in class.

Hope this helps.


Hi @cog,

I’ve got a custom dataset of sentences and their following sentences with labels of 0 and 1, looking like this:

features: [‘input_ids’, ‘token_type_ids’, ‘labels’],
num_rows: 61

where ‘input_ids’ are the two sentences seperated by a [SEP] token like this:

[102, 196, 10094, 30925, 232, 23180, 1555, 853, 16778, 2517, 7526, 223, 24805, 17632, 2380, 125, 5367, 17157, 30881, 223, 5438, 1556, 304, 125, 249, 10063, 2292, 6200, 2379, 11550, 216, 199, 196, 10094, 30925, 232, 24557, 14324, 103, 478, 7895, 195, 510, 18411, 406, 5367, 17157, 106, 212, 21300, 2272, 394, 30489, 105, 24889, 215, 513, 128, 387, 190, 3218, 10260, 9470, 16707, 4635, 566, 103]

The ‘token_type_ids’ are zeros for the first sentence and ones for the second sentence.

The ‘labels’ are just 0s or 1s.

Now I want to train a NSP-Task based on my custom dataset (that will be larger in the end).

Can I use the Trainer-API to easily train a BertForNextSentencePrediction?

Thank you in advance

yes. you can train BertForNextSentanceprediction with trainer.

just define model, and use Trainer.

Important thing is make Training arguments to fit BertForNextSentanceprediction class args.

Also, make sure dataloader output shape fit to model’s require input data.

Thank to nielsr and HF team, Here some tutorial about Fine-tuning bert.

There code use Trainer for fine-tuning BERT, so you can use similar function, method etc… to your works.


By Training arguments you mean the size of the tensor, e.g. of the logits?

Or do you mean that the dataset for Fine-Tuning should exactly match the form that given here: BertForNextSentencePrediction? (Like input_ids, attention_mask, output_hidden_states, …)

I will definitely do that tutorial, thank you so much!