Sequence Length in Continued Pretraining (MLM) & Masking Strategies


I am experimenting with domain adaptation (along the lines of DAPT from Gururangan et. al.: "Don’t Stop Pretraining) and subsequent fine-tuning on downstream tasks and am seeking advice on in-domain sequence length and masking strategies. Pretrained models I am dealing with are either BERT or RoBERTa checkpoints.


To my understanding DataCollatorForLanguageModeling replicates the masking strategy of RoBERTa and BERT (by default, replace 15% of tokens in the input sequence, 80% with tokenizer.mask_token, 10% with a random token and leaving the remaining tokens unchanged). Tokens here are subwords.

In your experience, does it make sense to play with the fraction of tokens that are replaced by [MASK] or random tokens?
Further, have you experienced benefits from using whole-word masking and if so, did you, in any way, attempt to ensure that the number of replaced tokens introduced in whole-word masking did not vary too much within a batch?
If anyone was aware of studies that looked into the effects of masking strategies on MLM/downstream model performance, then it would be great if you could link them to me.

Sequence Length

From how I understand the RoBERTa paper, BERT training (from scratch) starts with shorter sequences and also in later training iterations randomly injects short sequences. RoBERTa on the other hand trains only on full-length sequences (512 tokens in their case). contains two options:

  • truncating long sequences to max_length
  • grouping document texts and splitting them into chunks of max_seq_length

Intuitively - at least as far as domain adaptation is concerned - I feel that option 1 might retain context that is much closer to what the model would actually see when fine-tuning on downstream tasks. However, sequence lengths between input documents vary significantly in my case (median length: 278, median absolute deviation: 125, very long and very short outliers to be kicked out).

In your projects, have you observed effects on MLM/downstream task performance from picking one over the other strategy?

Unfortunately, my training budget is limited, so I cannot exhaustively test configurations. Hence, I am happy for any pointers to publications and/or personal anecdotes/experiences :slight_smile: