Hello.
I have a question which is similar with https://discuss.huggingface.co/t/concatenate-sentances/4233, but there was no answer, so I ask a question.
I make the script for training RoBERTa and I make the dataset and collator like below.
...
train_dataset = LineByLineTextDataset(
tokenizer=tokenizer,
file_path=os.path.join(dir, "train_data.txt"),
block_size=tokenizer.max_len_single_sentence
)
data_collator = DataCollatorForLanguageModeling(
tokenizer=tokenizer, mlm=True, mlm_probability=0.15
)
...
trainer = CustomTrainer(
model=model,
args=train_config,
data_collator=data_collator,
train_dataset=train_dataset,
custom_logger=logger
)
I check that the each length of train dataset is less than 512. I know that for RoBERTa training, I need to make multiple sentences to one sequence like [CLS] … [SEP] … [SEP] … [PAD].
Is trainer process it automatically? or Is there any method that I can handle it?