Preparing a nlp dataset for MLM

tillfurger · February 15, 2021, 7:25pm

Hi I’am trying to use nlp datasets to train a RoBERTa Model from scratch and I am not sure how to perpare the dataset to put it in the Trainer:

!pip install datasets
from datasets import load_dataset
dataset = load_dataset('wikicorpus', 'raw_en')

from transformers import DataCollatorForLanguageModeling
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, 
    mlm=True, 
    mlm_probability=0.15)

from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir="./",
    overwrite_output_dir=True,
    num_train_epochs=1,
    per_gpu_train_batch_size=16,
    save_steps=10_000,
    save_total_limit=2)

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=dataset)

How do I have to dataset.set_format() such that it only takes the text of the dataset, line-by-line?
Or what’s the proper way to prepare the dataset for MLM?

In the past I have been doing it with:

from transformers import LineByLineTextDataset
dataset = LineByLineTextDataset(
    tokenizer=tokenizer,
    file_path="/dataset.txt"
)

which will be removed soon and does not support multiple txt files.

Thanks

sgugger · February 16, 2021, 2:58am

You should have a look at the preprocessing done in the run_mlm example. There is also the corresponding notebook that can help.

Lucas · June 12, 2021, 7:25am

Dear there
The first link does not work. And the notebook you’ve mentioned does not even work in colab.

rashmi · June 12, 2021, 9:53am

This is the link run_mlm.py
I’m not sure about the notebook though.

viharshah85 · November 8, 2024, 6:48am

If you’re preparing an NLP dataset for a Masked Language Model (MLM), it’s important to have high-quality, diverse data to ensure the model can effectively understand and predict contextual language. For a comprehensive list of NLP datasets to help you get started, check out this blog: - Top NLP Datasets to Supercharge Your Machine Learning Models . These datasets offer a variety of text sources that can support a range of NLP tasks, including MLM training.

Topic		Replies	Views
Tokenizer to dataset to datacollator Beginners	1	1333	April 28, 2022
Script run_mlm.py line by line 🤗Transformers	1	693	January 24, 2022
Fine tune Masked Language Model on custom dataset Beginners	5	6109	August 20, 2020
LM from Scratch for Tensorflow 🤗Transformers	2	498	January 18, 2021
How to train a language model from scratch when my dataset is bigger than RAM? Beginners	19	9762	September 18, 2020

Preparing a nlp dataset for MLM

Related topics