Let’s say I have a pretrained BERT model (pretrained using NSP and MLM tasks as usual) on a large custom dataset.
The way I understand NSP to work is you take the embedding corresponding to the [CLS] token from the final layer and pass it onto a Linear layer that reduces it to 2 dimensions. Then, you apply a softmax on top of it to get predictions on whether the pair of sentences are consecutive or not.
Now, the published pretrained model I have does not include this “NSP head”, so I have to train one myself. How do I do this? Since the only parameters I presume I’ll need to tweak are the ones of the linear layer, will a small dataset be enough for this?
Why is the convention to throw away this NSP head? Isn’t it a useful thing to publish for others to use?