Pre-Train BERT (from scratch)

Well as you found, RoBERTa showed that leaving out NSP yields better results on downstream tasks. Albert then re-added a similar (yet very different) task, namely sentence order prediction, which improved performance on downstream tasks.

PS: please don’t post multiple consecutive posts but rather edit your posts to add more information. It’s a bit annoying with the notifications. :slight_smile:

3 Likes