Hello ,
I would really love to load a sample of the dataset rather than the whole data at first. Can I do this with hugging face library. I donāt want to download the full dataset as it is 23GB large rather just download a sample and work on it asap before working on the whole dataset.
Any ideas ??
As far as I know, this is something that is actively being worked on.
huggingface:master
ā huggingface:dataset-streaming
opened 06:20PM - 18 May 21 UTC
# Dataset Streaming
## API
Current API is
```python
from datasets impo⦠rt load_dataset
# Load an IterableDataset without downloading data
snli = load_dataset("snli", streaming=True)
# Access examples by streaming data
print(next(iter(snli["train"])))
# {'premise': 'A person on a horse jumps over a broken down airplane.',
# 'hypothesis': 'A person is training his horse for a competition.',
# 'label': 1}
```
I already implemented a few methods:
- IterableDataset.map: apply transforms on-the-fly to the examples
- IterableDataset.shuffle: shuffle the data _a la_ TFDS, i.e. with a shuffling buffer
- IterableDataset.with_format: set the format to `"torch"` to get a `torch.utils.data.IterableDataset`
- merge_datasets: merge two iterable datasets by alternating one or the other (you can specify the probabilities)
I would love to have your opinion on the API design :)
## Implementation details
### Streaming
Data streaming is done using `fsspec` which has nice caching features.
To make dataset streaming work I extend the `open` function of dataset scripts to support opening remote files without downloading them entirely. It also works with remote compressed archives (currently only zip is supported):
```python
# Get a file-like object by streaming data from a remote file
open("https://github.com/davidsbatista/NER-datasets/raw/master/CONLL2003/train.txt")
# Get a file-like object by streaming data from a remote compressed archive by using the hop separator "::"
open("zip://snli_1.0_train.txt::https://nlp.stanford.edu/projects/snli/snli_1.0.zip")
```
I also extend the `os.path.join` function to support navigation in remote compressed archives, since it has to deal with the `"::"` separator. This separator is used by `fsspec`.
Finally I also added a retry mechanism in case the connection fails during data streaming.
### Transforms
An IterableDataset wraps an ExamplesIterable instance. There are different subclasses depending on the transforms we want to apply:
- ExamplesIterable: the basic one
- MappedExamplesIterable: an iterable with a `map` function applied on the fly
- BufferShuffledExamplesIterable: an iterable with a shuffling buffer
- CyclingMultiSourcesExamplesIterable: alternates between several ExamplesIterable
- RandomlyCyclingMultiSourcesExamplesIterable: randomly alternates between several ExamplesIterable
### DatasetBuilder
I use the same builders as usual. I just added a new method `_get_examples_iterable_for_split` to get an ExamplesIterable for a given split. Currently only the GeneratorBasedBuilder and the ArrowBasedBuilder implement it.
The BeamBasedBuilder doesn't implement it yet.
It means that datasets like wikipedia and natural_questions can't be loaded as IterableDataset for now.
## Other details
<S>I may have to do some changes in many dataset script to use `download` instead of `download_and_extract` when extraction is not needed. This will avoid errors for streaming.</s>
EDIT: Actually I just check for the extension of the file to do extraction only if needed.
EDIT2: It's not possible to stream from .tar.gz files without downloading the file completely. For now I raise an error if one want to get a streaming dataset based on .tar.gz files.
## TODO
usual stuff:
- [x] make streaming dependency "aiohttp" optional: `pip install datasets[streaming]`
- [x] tests
- [x] docs
@lhoestq might be able to provide more info
2 Likes
Thanks for the info. Such a time saver.
This, changes everything. Canāt wait @lhoestq