Roadmap/timeline for dataset streaming

vblagoje · June 14, 2021, 10:41am

Is there a roadmap/timeline for dataset streaming?

Cheers

lewtun · June 14, 2021, 12:12pm

hey @vblagoje even better, there’s an open PR: Dataset Streaming by lhoestq · Pull Request #2375 · huggingface/datasets · GitHub

vblagoje · June 14, 2021, 1:18pm

Excellent, thanks @lewtun so we are talking 1.9 as the most likely release or 2.0?

lewtun · June 14, 2021, 5:15pm

i’ll let quentin answer that since he’s the master of all datasets releases

lhoestq · June 14, 2021, 5:43pm

Hi ! Most likely in 1.9 (next week maybe ?)

It adds a new datasets.IterableDataset object that you can load by passing streaming=True in load_dataset. You can iterate over it using a for loop for example.

You can use it to load your dataset, define your data processing using map and shuffle, and to train models.

After 1.9 we’ll also keep developing more features around streaming after the release. For example:

a nice integration with pytorch/tensorflow/jax
a conversion method to get back a map-style datasets.Dataset object.
additional methods like filter

And also improve the caching/buffering mechanism as well as the streaming from compressed data.

Leena · June 30, 2021, 10:09pm

Hi,

I have a general question regarding Streaming mode – Is IterableDataset not to be used with Pytorch DataLoader? I can use Dataset with the DataLoader without any issues (as is also mentioned in the examples), but I cannot do so with the former. I am quite new to the HF Dataset library so my apologies if this is already mentioned somewhere (I am still looking).

I get the following error, which makes sense because this is streaming mode, but I am unclear about how to design so that I can do batching then:
File “/data/leshekha/lib/HFDatasets/lib/python3.6/site-packages/torch/utils/data/sampler.py”, line 67, in iter return iter(range(len(self.data_source))) TypeError: object of type ‘IterableDataset’ has no len()
Any help is appreciated. Thank you.

vblagoje · July 2, 2021, 10:50am

Hi Leena,

If you are using HF Trainer look into overriding get_train_dataloader where you can provide your own instance of torch Sampler initialized with a dataset size.

HTH,
Vladimir

Leena · July 2, 2021, 5:58pm

Hi Vladimir,

Thank you for your response. I am not using HF trainer, I have my own trainer, which essentially means I create my own Dataset, DataLoader, training loop etc (I have not used HF trainer till now may be I will explore that). If my understanding is right then, get_train_dataloader returns a Torch DataLoader so working with HF Trainer and my own Trainer should be very similar.

Looking at the Pytorch code it seems like when IterableDataset is used then _InfiniteConstantSampler is the what the sampler is set to. I created a custom IterableDataset in Torch (not HF) and am able to create batches using this. When I create a HF IterableDataset (using streaming=True) then I am not able to iterate on this as it throws error and you can see that the sampler is SequentialSampler.

Here is a reproducible example:

Using Torch IterableDataset and Torch DataLoader:

class CustomIterableDataset(IterableDataset):

  def __init__(self, data):
        self.data = data

  def __iter__(self):
        return iter(self.data)

data = list(range(12))
dataset = CustomIterableDataset(data)
dataloader = DataLoader(dataset, batch_size=4)
print("dataloader: ", dataloader.sampler)
for batch in dataloader:
    print(batch)

Output is:
dataloader: <torch.utils.data.dataloader._InfiniteConstantSampler object at 0x7f1cc29e2c50>
tensor([0, 1, 2, 3])
tensor([4, 5, 6, 7])
tensor([ 8, 9, 10, 11])

Using HF IterableDataset and Torch DataLoader:

from datasets import load_dataset

dataset = load_dataset('oscar', "unshuffled_deduplicated_en", split='train', streaming=True)
dataloader = torch.utils.data.DataLoader(dataset, batch_size=32)
print(dataloader.sampler)
<torch.utils.data.sampler.SequentialSampler object at 0x7f245a510208>
for batch in dataloader:
    print(batch)

Error is:
Traceback (most recent call last):
File “”, line 1, in
File “/data/leshekha/lib/HFDatasets/lib/python3.6/site-packages/torch/utils/data/dataloader.py”, line 435, in next
data = self._next_data()
File “/data/leshekha/lib/HFDatasets/lib/python3.6/site-packages/torch/utils/data/dataloader.py”, line 474, in _next_data
index = self._next_index() # may raise StopIteration
File “/data/leshekha/lib/HFDatasets/lib/python3.6/site-packages/torch/utils/data/dataloader.py”, line 427, in _next_index
return next(self._sampler_iter) # may raise StopIteration
File “/data/leshekha/lib/HFDatasets/lib/python3.6/site-packages/torch/utils/data/sampler.py”, line 227, in iter
for idx in self.sampler:
File “/data/leshekha/lib/HFDatasets/lib/python3.6/site-packages/torch/utils/data/sampler.py”, line 67, in iter
return iter(range(len(self.data_source)))
TypeError: object of type ‘IterableDataset’ has no len()

Thanks,
Leena

thomwolf · July 5, 2021, 8:12pm

Solution at Error iteration over IterableDataset using Torch DataLoader · Issue #2583 · huggingface/datasets · GitHub

Leena · July 5, 2021, 11:59pm

Thank you for the solution and the link. I was just about to paste it here. This has been solved.

Topic		Replies	Views
Streaming dataset and cache 🤗Datasets	5	3554	August 4, 2023
How do i batch in streaming of data set Intermediate	1	43	May 3, 2025
Slow DataLoader with big batch_size 🤗Datasets	4	1736	October 5, 2023
How to steaming .hf dataset 🤗Datasets	5	69	November 30, 2024
How do I iterate through <class 'datasets.dataset_dict.IterableDatasetDict'>? Beginners	2	2943	January 15, 2024

Roadmap/timeline for dataset streaming

Related topics