Roadmap/timeline for dataset streaming

Hey @lhoestq ,

Is there a roadmap/timeline for dataset streaming?

Cheers

hey @vblagoje even better, there’s an open PR: Dataset Streaming by lhoestq · Pull Request #2375 · huggingface/datasets · GitHub

1 Like

Excellent, thanks @lewtun so we are talking 1.9 as the most likely release or 2.0?

i’ll let quentin answer that since he’s the master of all datasets releases :wink:

Hi ! Most likely in 1.9 (next week maybe ?)

It adds a new datasets.IterableDataset object that you can load by passing streaming=True in load_dataset. You can iterate over it using a for loop for example.

You can use it to load your dataset, define your data processing using map and shuffle, and to train models.

After 1.9 we’ll also keep developing more features around streaming after the release. For example:

  • a nice integration with pytorch/tensorflow/jax
  • a conversion method to get back a map-style datasets.Dataset object.
  • additional methods like filter

And also improve the caching/buffering mechanism as well as the streaming from compressed data.

2 Likes

Hi,

I have a general question regarding Streaming mode – Is IterableDataset not to be used with Pytorch DataLoader? I can use Dataset with the DataLoader without any issues (as is also mentioned in the examples), but I cannot do so with the former. I am quite new to the HF Dataset library so my apologies if this is already mentioned somewhere (I am still looking).

I get the following error, which makes sense because this is streaming mode, but I am unclear about how to design so that I can do batching then:

File “/data/leshekha/lib/HFDatasets/lib/python3.6/site-packages/torch/utils/data/sampler.py”, line 67, in iter
return iter(range(len(self.data_source)))
TypeError: object of type ‘IterableDataset’ has no len()

Any help is appreciated. Thank you.

Hi Leena,

If you are using HF Trainer look into overriding get_train_dataloader where you can provide your own instance of torch Sampler initialized with a dataset size.

HTH,
Vladimir

Hi Vladimir,

Thank you for your response. I am not using HF trainer, I have my own trainer, which essentially means I create my own Dataset, DataLoader, training loop etc (I have not used HF trainer till now may be I will explore that). If my understanding is right then, get_train_dataloader returns a Torch DataLoader so working with HF Trainer and my own Trainer should be very similar.

Looking at the Pytorch code it seems like when IterableDataset is used then _InfiniteConstantSampler is the what the sampler is set to. I created a custom IterableDataset in Torch (not HF) and am able to create batches using this. When I create a HF IterableDataset (using streaming=True) then I am not able to iterate on this as it throws error and you can see that the sampler is SequentialSampler.

Here is a reproducible example:

  1. Using Torch IterableDataset and Torch DataLoader:
class CustomIterableDataset(IterableDataset):

  def __init__(self, data):
        self.data = data

  def __iter__(self):
        return iter(self.data)

data = list(range(12))
dataset = CustomIterableDataset(data)
dataloader = DataLoader(dataset, batch_size=4)
print("dataloader: ", dataloader.sampler)
for batch in dataloader:
    print(batch)

Output is:
dataloader: <torch.utils.data.dataloader._InfiniteConstantSampler object at 0x7f1cc29e2c50>
tensor([0, 1, 2, 3])
tensor([4, 5, 6, 7])
tensor([ 8, 9, 10, 11])

  1. Using HF IterableDataset and Torch DataLoader:
from datasets import load_dataset

dataset = load_dataset('oscar', "unshuffled_deduplicated_en", split='train', streaming=True)
dataloader = torch.utils.data.DataLoader(dataset, batch_size=32)
print(dataloader.sampler)
<torch.utils.data.sampler.SequentialSampler object at 0x7f245a510208>
for batch in dataloader:
    print(batch)

Error is:
Traceback (most recent call last):
File “”, line 1, in
File “/data/leshekha/lib/HFDatasets/lib/python3.6/site-packages/torch/utils/data/dataloader.py”, line 435, in next
data = self._next_data()
File “/data/leshekha/lib/HFDatasets/lib/python3.6/site-packages/torch/utils/data/dataloader.py”, line 474, in _next_data
index = self._next_index() # may raise StopIteration
File “/data/leshekha/lib/HFDatasets/lib/python3.6/site-packages/torch/utils/data/dataloader.py”, line 427, in _next_index
return next(self._sampler_iter) # may raise StopIteration
File “/data/leshekha/lib/HFDatasets/lib/python3.6/site-packages/torch/utils/data/sampler.py”, line 227, in iter
for idx in self.sampler:
File “/data/leshekha/lib/HFDatasets/lib/python3.6/site-packages/torch/utils/data/sampler.py”, line 67, in iter
return iter(range(len(self.data_source)))
TypeError: object of type ‘IterableDataset’ has no len()

Thanks,
Leena

Solution at Error iteration over IterableDataset using Torch DataLoader · Issue #2583 · huggingface/datasets · GitHub

4 Likes

Thank you for the solution and the link. I was just about to paste it here. This has been solved.