Is there a way to prevent certain tokens from being masked during pretraining?

keffinel · August 30, 2020, 1:54pm

Hello, I am trying to pretrain various versions of BERT on a code corpus. I am using BPE tokenizer. The issue is that since newline characters are abundant in code they end up getting masked for prediction. This leads to the model predicting newlines often which is useless in code. Is there some way to prevent (the datacollator?) from masking certain tokens (in this case newlines/tab/spaces)? Or is there another solution to this?

Since the corpus is huge relative to the hardware I have, it would save some expensive preprocessing of the dataset.

sgugger · August 31, 2020, 2:03pm

There is no mechanism implemented for this, so you should copy the code of the data collator you are using and adapt it a little bit to not mask your tokens.

Topic		Replies	Views
Where in the code does masking of tokens happen when pretraining BERT Beginners	5	7268	August 17, 2020
Using a dataset with already masked tokens Beginners	2	702	February 3, 2021
Fine-tuning BERT with deterministic masking instead of random masking Beginners	0	162	April 22, 2024
BERT embeddings on big dataset 🤗Datasets	3	123	August 28, 2024
Pretrain own model 🤗Transformers	0	270	October 23, 2023

Is there a way to prevent certain tokens from being masked during pretraining?

Related topics