Streaming Dataset Roberta

Anyone know of RoBERTa pretraining script with support for Dataset streadmin?

Hi ! I don’t think the community has already shared a script for RoBERTa pretraining using dataset streaming yet. However if you’re interested in looking into this, here are a few pointers:

RoBERTa was trained with BookCorpus, CC news and OpenWebText

BookCorpus and OpenWebText have been replicated and open sourced as BookCorpusOpen and OpenWebText2 (The Pile)

You can load and interleave the datasets with

from datasets import load_dataset, interleave_datasets

def only_keep_text(example):
    return {"text": example["text"]}

bc = load_dataset("bookcorpusopen", split="train", streaming=True)
ccn = load_dataset("cc_news", split="train", streaming=True)
# this one currently has streaming issues - will fix soon
# owt = load_dataset("the_pile_openwebtext2", split="train", streaming=True)  

dataset = interleave_datasets([
    bc.map(only_keep_text),
    ccn.map(only_keep_text),
    # owt.map(only_keep_text)
])

Then you can check the documentation to see how to use it in a pytorch training loop: Stream — datasets 1.16.1 documentation