Well as you found, RoBERTa showed that leaving out NSP yields better results on downstream tasks. Albert then re-added a similar (yet very different) task, namely sentence order prediction, which improved performance on downstream tasks.
PS: please don’t post multiple consecutive posts but rather edit your posts to add more information. It’s a bit annoying with the notifications.