BERT Next Sentence Prediction: How to do predictions?

Let’s say I have a pretrained BERT model (pretrained using NSP and MLM tasks as usual) on a large custom dataset.

The way I understand NSP to work is you take the embedding corresponding to the [CLS] token from the final layer and pass it onto a Linear layer that reduces it to 2 dimensions. Then, you apply a softmax on top of it to get predictions on whether the pair of sentences are consecutive or not.

Now, the published pretrained model I have does not include this “NSP head”, so I have to train one myself. How do I do this? Since the only parameters I presume I’ll need to tweak are the ones of the linear layer, will a small dataset be enough for this?

Why is the convention to throw away this NSP head? Isn’t it a useful thing to publish for others to use?