Streaming Dataset Roberta

pere · December 2, 2021, 6:26pm

Anyone know of RoBERTa pretraining script with support for Dataset streadmin?

lhoestq · December 7, 2021, 12:07pm

Hi ! I don’t think the community has already shared a script for RoBERTa pretraining using dataset streaming yet. However if you’re interested in looking into this, here are a few pointers:

RoBERTa was trained with BookCorpus, CC news and OpenWebText

BookCorpus and OpenWebText have been replicated and open sourced as BookCorpusOpen and OpenWebText2 (The Pile)

You can load and interleave the datasets with

from datasets import load_dataset, interleave_datasets

def only_keep_text(example):
    return {"text": example["text"]}

bc = load_dataset("bookcorpusopen", split="train", streaming=True)
ccn = load_dataset("cc_news", split="train", streaming=True)
# this one currently has streaming issues - will fix soon
# owt = load_dataset("the_pile_openwebtext2", split="train", streaming=True)  

dataset = interleave_datasets([
    bc.map(only_keep_text),
    ccn.map(only_keep_text),
    # owt.map(only_keep_text)
])

Then you can check the documentation to see how to use it in a pytorch training loop: Stream — datasets 1.16.1 documentation

Topic		Replies	Views
BERT pre-training run_mlm_flax.py questions Beginners	0	254	November 3, 2021
Further pre-train roberta model Beginners	1	1390	July 14, 2020
The most efficient way for predictions(zero-shot classification) on huge dataset Beginners	0	528	July 6, 2022
Finetune xlm roberta base(overfitting ,any solution ) Beginners	3	449	December 26, 2023
Pre-Training From Scratch 🤗Transformers	0	1003	October 6, 2021

Streaming Dataset Roberta

Related topics