Sequence Length in Continued Pretraining (MLM) & Masking Strategies

sim · January 6, 2022, 2:27pm

Hi,

I am experimenting with domain adaptation (along the lines of DAPT from Gururangan et. al.: "Don’t Stop Pretraining) and subsequent fine-tuning on downstream tasks and am seeking advice on in-domain sequence length and masking strategies. Pretrained models I am dealing with are either BERT or RoBERTa checkpoints.

Masking

To my understanding DataCollatorForLanguageModeling replicates the masking strategy of RoBERTa and BERT (by default, replace 15% of tokens in the input sequence, 80% with tokenizer.mask_token, 10% with a random token and leaving the remaining tokens unchanged). Tokens here are subwords.

In your experience, does it make sense to play with the fraction of tokens that are replaced by [MASK] or random tokens?
Further, have you experienced benefits from using whole-word masking and if so, did you, in any way, attempt to ensure that the number of replaced tokens introduced in whole-word masking did not vary too much within a batch?
If anyone was aware of studies that looked into the effects of masking strategies on MLM/downstream model performance, then it would be great if you could link them to me.

Sequence Length

From how I understand the RoBERTa paper, BERT training (from scratch) starts with shorter sequences and also in later training iterations randomly injects short sequences. RoBERTa on the other hand trains only on full-length sequences (512 tokens in their case).

run_mlm.py contains two options:

truncating long sequences to max_length
grouping document texts and splitting them into chunks of max_seq_length

Intuitively - at least as far as domain adaptation is concerned - I feel that option 1 might retain context that is much closer to what the model would actually see when fine-tuning on downstream tasks. However, sequence lengths between input documents vary significantly in my case (median length: 278, median absolute deviation: 125, very long and very short outliers to be kicked out).

In your projects, have you observed effects on MLM/downstream task performance from picking one over the other strategy?

Unfortunately, my training budget is limited, so I cannot exhaustively test configurations. Hence, I am happy for any pointers to publications and/or personal anecdotes/experiences

Topic		Replies	Views
Sequence masking 🤗Transformers	0	379	April 25, 2022
[URGENT] Issues with Training RoBERTa Model for Text Prediction with Fill Mask Task 🤗Transformers	6	216	March 19, 2024
Question about truncate length of tokenizer Beginners	1	1248	September 20, 2022
Fine tune Masked Language Model on custom dataset Beginners	5	6064	August 20, 2020
Further pre-training the tokenizer? 🤗Tokenizers	0	821	April 30, 2022

Sequence Length in Continued Pretraining (MLM) & Masking Strategies

Masking

Sequence Length

Related topics