I have a question which is similar with https://discuss.huggingface.co/t/concatenate-sentances/4233, but there was no answer, so I ask a question.
I make the script for training RoBERTa and I make the dataset and collator like below.
... train_dataset = LineByLineTextDataset( tokenizer=tokenizer, file_path=os.path.join(dir, "train_data.txt"), block_size=tokenizer.max_len_single_sentence ) data_collator = DataCollatorForLanguageModeling( tokenizer=tokenizer, mlm=True, mlm_probability=0.15 ) ... trainer = CustomTrainer( model=model, args=train_config, data_collator=data_collator, train_dataset=train_dataset, custom_logger=logger )
I check that the each length of train dataset is less than 512. I know that for RoBERTa training, I need to make multiple sentences to one sequence like [CLS] … [SEP] … [SEP] … [PAD].
Is trainer process it automatically? or Is there any method that I can handle it?