Multiple sentences in RoBERTa training

HyeyeonKoo · August 10, 2021, 11:32pm

Hello.

I have a question which is similar with https://discuss.huggingface.co/t/concatenate-sentances/4233, but there was no answer, so I ask a question.

I make the script for training RoBERTa and I make the dataset and collator like below.

...
        train_dataset = LineByLineTextDataset(
            tokenizer=tokenizer,
            file_path=os.path.join(dir, "train_data.txt"),
            block_size=tokenizer.max_len_single_sentence
        )

        data_collator = DataCollatorForLanguageModeling(
            tokenizer=tokenizer, mlm=True, mlm_probability=0.15
        )

...

        trainer = CustomTrainer(
            model=model,
            args=train_config,
            data_collator=data_collator,
            train_dataset=train_dataset,
            custom_logger=logger
        )

I check that the each length of train dataset is less than 512. I know that for RoBERTa training, I need to make multiple sentences to one sequence like [CLS] … [SEP] … [SEP] … [PAD].
Is trainer process it automatically? or Is there any method that I can handle it?

Topic		Replies	Views
Concatenate Sentances Beginners	0	414	March 8, 2021
Data-prep for new portuguese RoBERTa from scratch Models	4	410	May 20, 2021
BERT pre-training run_mlm_flax.py questions Beginners	0	254	November 3, 2021
Best solution for train tokenizer and MLM from scratch 🤗Tokenizers	0	729	December 6, 2021
Pre-Training From Scratch 🤗Transformers	0	1004	October 6, 2021

Multiple sentences in RoBERTa training

Related topics