Is there a way to prevent certain tokens from being masked during pretraining?

Hello, I am trying to pretrain various versions of BERT on a code corpus. I am using BPE tokenizer. The issue is that since newline characters are abundant in code they end up getting masked for prediction. This leads to the model predicting newlines often which is useless in code. Is there some way to prevent (the datacollator?) from masking certain tokens (in this case newlines/tab/spaces)? Or is there another solution to this?

Since the corpus is huge relative to the hardware I have, it would save some expensive preprocessing of the dataset.

There is no mechanism implemented for this, so you should copy the code of the data collator you are using and adapt it a little bit to not mask your tokens.