Can every line in the input CSV file contain more than one sentence when pertraining BERT for MLM Loss?

abhisheksgumadi · February 23, 2021, 10:15pm

Hello HF Team,

I am familiar with how to pretrain BERT and I have a Dataloader that reads an input CSV file line by line and every time it reads a line, it tokenizes it and sends back the tokens for training to the training code. My question is whether is it ok for this input CSV file to contain more than one sentence on every line when pretraining BERT for masked language modelling?

Otherwise, is it important for it to contain only one meaningful sentence only? I am thinking if self attention will still continue to work and the model train properly even if every single line in the input CSV file (single training sample) is actually more than one sentence, each of it separated with a ‘.’ delimiter of course.

Thanks

Topic		Replies	Views
Continual pre-training from an initial checkpoint with MLM and NSP Models	4	4249	September 8, 2021
Pre-train BERT with HF Trainer 🤗Transformers	0	738	April 22, 2022
Sentence splitting 🤗Tokenizers	7	31278	September 15, 2022
Original Bert Pretraining Intermediate	0	538	January 10, 2022
Fine-tuning a masked language model Beginners	0	353	February 2, 2022

Can every line in the input CSV file contain more than one sentence when pertraining BERT for MLM Loss?

Related topics