Use load dataset to load a sample of the dataset

theainerd · May 19, 2021, 7:02am

Hello ,

I would really love to load a sample of the dataset rather than the whole data at first. Can I do this with hugging face library. I don’t want to download the full dataset as it is 23GB large rather just download a sample and work on it asap before working on the whole dataset.

Any ideas ??

osanseviero · May 19, 2021, 7:21am

As far as I know, this is something that is actively being worked on.

github.com/huggingface/datasets

Dataset Streaming

huggingface:master ← huggingface:dataset-streaming

opened 06:20PM - 18 May 21 UTC

lhoestq

+1646 -29

# Dataset Streaming ## API Current API is ```python from datasets impo…rt load_dataset # Load an IterableDataset without downloading data snli = load_dataset("snli", streaming=True) # Access examples by streaming data print(next(iter(snli["train"]))) # {'premise': 'A person on a horse jumps over a broken down airplane.', # 'hypothesis': 'A person is training his horse for a competition.', # 'label': 1} ``` I already implemented a few methods: - IterableDataset.map: apply transforms on-the-fly to the examples - IterableDataset.shuffle: shuffle the data _a la_ TFDS, i.e. with a shuffling buffer - IterableDataset.with_format: set the format to `"torch"` to get a `torch.utils.data.IterableDataset` - merge_datasets: merge two iterable datasets by alternating one or the other (you can specify the probabilities) I would love to have your opinion on the API design :) ## Implementation details ### Streaming Data streaming is done using `fsspec` which has nice caching features. To make dataset streaming work I extend the `open` function of dataset scripts to support opening remote files without downloading them entirely. It also works with remote compressed archives (currently only zip is supported): ```python # Get a file-like object by streaming data from a remote file open("https://github.com/davidsbatista/NER-datasets/raw/master/CONLL2003/train.txt") # Get a file-like object by streaming data from a remote compressed archive by using the hop separator "::" open("zip://snli_1.0_train.txt::https://nlp.stanford.edu/projects/snli/snli_1.0.zip") ``` I also extend the `os.path.join` function to support navigation in remote compressed archives, since it has to deal with the `"::"` separator. This separator is used by `fsspec`. Finally I also added a retry mechanism in case the connection fails during data streaming. ### Transforms An IterableDataset wraps an ExamplesIterable instance. There are different subclasses depending on the transforms we want to apply: - ExamplesIterable: the basic one - MappedExamplesIterable: an iterable with a `map` function applied on the fly - BufferShuffledExamplesIterable: an iterable with a shuffling buffer - CyclingMultiSourcesExamplesIterable: alternates between several ExamplesIterable - RandomlyCyclingMultiSourcesExamplesIterable: randomly alternates between several ExamplesIterable ### DatasetBuilder I use the same builders as usual. I just added a new method `_get_examples_iterable_for_split` to get an ExamplesIterable for a given split. Currently only the GeneratorBasedBuilder and the ArrowBasedBuilder implement it. The BeamBasedBuilder doesn't implement it yet. It means that datasets like wikipedia and natural_questions can't be loaded as IterableDataset for now. ## Other details <S>I may have to do some changes in many dataset script to use `download` instead of `download_and_extract` when extraction is not needed. This will avoid errors for streaming.</s> EDIT: Actually I just check for the extension of the file to do extraction only if needed. EDIT2: It's not possible to stream from .tar.gz files without downloading the file completely. For now I raise an error if one want to get a streaming dataset based on .tar.gz files. ## TODO usual stuff: - [x] make streaming dependency "aiohttp" optional: `pip install datasets[streaming]` - [x] tests - [x] docs

@lhoestq might be able to provide more info

theainerd · May 19, 2021, 7:26am

Thanks for the info. Such a time saver.

vblagoje · May 24, 2021, 5:11pm

This, changes everything. Can’t wait @lhoestq

Topic		Replies	Views
How to use Huggingface Trainer streaming Datasets without wrapping it with torchdata's IterableWrapper? 🤗Datasets	1	4576	October 30, 2022
How to split a Hugging Face dataset in streaming mode without loading it into memory? Beginners	0	213	May 17, 2024
How do I download and load a dataset in batches without caching all of it? 🤗Datasets	1	226	September 16, 2024
Train through multiple datasets Beginners	1	1633	June 13, 2022
Roadmap/timeline for dataset streaming 🤗Datasets	9	2271	July 5, 2021

Use load dataset to load a sample of the dataset

Related topics