Hi I’am trying to use nlp datasets to train a RoBERTa Model from scratch and I am not sure how to perpare the dataset to put it in the Trainer:
!pip install datasets
from datasets import load_dataset
dataset = load_dataset('wikicorpus', 'raw_en')
from transformers import DataCollatorForLanguageModeling
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer,
mlm=True,
mlm_probability=0.15)
from transformers import Trainer, TrainingArguments
training_args = TrainingArguments(
output_dir="./",
overwrite_output_dir=True,
num_train_epochs=1,
per_gpu_train_batch_size=16,
save_steps=10_000,
save_total_limit=2)
trainer = Trainer(
model=model,
args=training_args,
data_collator=data_collator,
train_dataset=dataset)
How do I have to dataset.set_format()
such that it only takes the text
of the dataset, line-by-line?
Or what’s the proper way to prepare the dataset for MLM?
In the past I have been doing it with:
from transformers import LineByLineTextDataset
dataset = LineByLineTextDataset(
tokenizer=tokenizer,
file_path="/dataset.txt"
)
which will be removed soon and does not support multiple txt files.
Thanks