Hello, I am trying to pretrain various versions of BERT on a code corpus. I am using BPE tokenizer. The issue is that since newline characters are abundant in code they end up getting masked for prediction. This leads to the model predicting newlines often which is useless in code. Is there some way to prevent (the datacollator?) from masking certain tokens (in this case newlines/tab/spaces)? Or is there another solution to this?
Since the corpus is huge relative to the hardware I have, it would save some expensive preprocessing of the dataset.