Preparing a nlp dataset for MLM

Hi I’am trying to use nlp datasets to train a RoBERTa Model from scratch and I am not sure how to perpare the dataset to put it in the Trainer:

!pip install datasets
from datasets import load_dataset
dataset = load_dataset('wikicorpus', 'raw_en')

from transformers import DataCollatorForLanguageModeling
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, 
    mlm=True, 
    mlm_probability=0.15)

from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir="./",
    overwrite_output_dir=True,
    num_train_epochs=1,
    per_gpu_train_batch_size=16,
    save_steps=10_000,
    save_total_limit=2)

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=dataset)

How do I have to dataset.set_format() such that it only takes the text of the dataset, line-by-line?
Or what’s the proper way to prepare the dataset for MLM?

In the past I have been doing it with:

from transformers import LineByLineTextDataset
dataset = LineByLineTextDataset(
    tokenizer=tokenizer,
    file_path="/dataset.txt"
)

which will be removed soon and does not support multiple txt files.

Thanks

You should have a look at the preprocessing done in the run_mlm example. There is also the corresponding notebook that can help.

1 Like

Dear there
The first link does not work. And the notebook you’ve mentioned does not even work in colab.

This is the link run_mlm.py
I’m not sure about the notebook though.

If you’re preparing an NLP dataset for a Masked Language Model (MLM), it’s important to have high-quality, diverse data to ensure the model can effectively understand and predict contextual language. For a comprehensive list of NLP datasets to help you get started, check out this blog: - Top NLP Datasets to Supercharge Your Machine Learning Models . These datasets offer a variety of text sources that can support a range of NLP tasks, including MLM training.