Cloning dataset in MLM training

dkoterwa · April 13, 2023, 12:33pm

Hi there,
I wanted to ask if it is a common practice to clone the training dataset when pre-training a model using MLM technique. Because of the probabilities, model will probably always mask different tokens and will have a different task, even when two provided sentences are the same.

Example:
First iteration of dataset: I ate an apple —masking—> I [MASK] an apple
Second iteration of dataset: I ate an apple --masking–=> I ate [MASK] apple

Thanks is advance for every comment

Topic		Replies	Views
Finetuning on MLM task Models	0	659	June 29, 2021
How to customize BERT MLM task Beginners	6	1783	September 27, 2023
Bert LM pretraining: training loss goes to 0 at masking probability of 0.999 Beginners	2	2319	October 31, 2020
MLM vs CLM, can be exchanged? Models	0	1053	August 21, 2022
How to "further pretrain" a tokenizer (do I need to do so?) 🤗Tokenizers	5	4386	February 20, 2022

Cloning dataset in MLM training

Related topics