Streaming Wikipedia dataset

When using the load_dataset('wikipedia', '20220301.en') dataset with streaming=True, it throws a ValueError:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-4-af223b124995> in <cell line: 1>()
----> 1 load_dataset('wikipedia', lang, streaming=True)

1 frames
/usr/local/lib/python3.9/dist-packages/datasets/builder.py in as_streaming_dataset(self, split, base_path)
   1247     ) -> Union[Dict[str, IterableDataset], IterableDataset]:
   1248         if not isinstance(self, (GeneratorBasedBuilder, ArrowBasedBuilder)):
-> 1249             raise ValueError(f"Builder {self.name} is not streamable.")
   1250 
   1251         is_local = not is_remote_filesystem(self._fs)

ValueError: Builder wikipedia is not streamable.

And to “work-around” it, it seems a little meta (fourth-wall), and this works:


from datasets import load_dataset, IterableDataset

from torch.utils.data import DataLoader
from torchdata.datapipes.iter import IterDataPipe, IterableWrapper

# Load from HF.
_ds = load_dataset('wikipedia', '20220301.en')

def _ds_gen():
    for i in range(len(_ds)):
        yield _ds['train'][i]

dataloader = DataLoader(
    IterableDataset.from_generator(_ds_gen)
)

Is there any way to add above code that generator wrapper over the load_dataset('wikipedia', '20220301.en', streaming=True)?

Hi! We are working on making the wikipedia dataset streamable in this PR: Support streaming Beam datasets from HF GCS preprocessed data by albertvillanova · Pull Request #5689 · huggingface/datasets · GitHub

1 Like

Thanks for the prompt reply! I guess for now, we have to stream the dataset with the “meta-snippet”.