Hey @lhoestq ,
Is there a roadmap/timeline for dataset streaming?
Cheers
hey @vblagoje even better, there’s an open PR: Dataset Streaming by lhoestq · Pull Request #2375 · huggingface/datasets · GitHub
Excellent, thanks @lewtun so we are talking 1.9 as the most likely release or 2.0?
i’ll let quentin answer that since he’s the master of all datasets
releases
Hi ! Most likely in 1.9 (next week maybe ?)
It adds a new datasets.IterableDataset
object that you can load by passing streaming=True
in load_dataset
. You can iterate over it using a for
loop for example.
You can use it to load your dataset, define your data processing using map
and shuffle
, and to train models.
After 1.9 we’ll also keep developing more features around streaming after the release. For example:
datasets.Dataset
object.filter
And also improve the caching/buffering mechanism as well as the streaming from compressed data.
Hi,
I have a general question regarding Streaming mode – Is IterableDataset not to be used with Pytorch DataLoader? I can use Dataset with the DataLoader without any issues (as is also mentioned in the examples), but I cannot do so with the former. I am quite new to the HF Dataset library so my apologies if this is already mentioned somewhere (I am still looking).
I get the following error, which makes sense because this is streaming mode, but I am unclear about how to design so that I can do batching then:
File “/data/leshekha/lib/HFDatasets/lib/python3.6/site-packages/torch/utils/data/sampler.py”, line 67, in iter
return iter(range(len(self.data_source)))
TypeError: object of type ‘IterableDataset’ has no len()
Any help is appreciated. Thank you.
Hi Leena,
If you are using HF Trainer look into overriding get_train_dataloader
where you can provide your own instance of torch Sampler initialized with a dataset size.
HTH,
Vladimir
Hi Vladimir,
Thank you for your response. I am not using HF trainer, I have my own trainer, which essentially means I create my own Dataset, DataLoader, training loop etc (I have not used HF trainer till now may be I will explore that). If my understanding is right then, get_train_dataloader returns a Torch DataLoader so working with HF Trainer and my own Trainer should be very similar.
Looking at the Pytorch code it seems like when IterableDataset is used then _InfiniteConstantSampler is the what the sampler is set to. I created a custom IterableDataset in Torch (not HF) and am able to create batches using this. When I create a HF IterableDataset (using streaming=True) then I am not able to iterate on this as it throws error and you can see that the sampler is SequentialSampler.
Here is a reproducible example:
class CustomIterableDataset(IterableDataset):
def __init__(self, data):
self.data = data
def __iter__(self):
return iter(self.data)
data = list(range(12))
dataset = CustomIterableDataset(data)
dataloader = DataLoader(dataset, batch_size=4)
print("dataloader: ", dataloader.sampler)
for batch in dataloader:
print(batch)
Output is:
dataloader: <torch.utils.data.dataloader._InfiniteConstantSampler object at 0x7f1cc29e2c50>
tensor([0, 1, 2, 3])
tensor([4, 5, 6, 7])
tensor([ 8, 9, 10, 11])
from datasets import load_dataset
dataset = load_dataset('oscar', "unshuffled_deduplicated_en", split='train', streaming=True)
dataloader = torch.utils.data.DataLoader(dataset, batch_size=32)
print(dataloader.sampler)
<torch.utils.data.sampler.SequentialSampler object at 0x7f245a510208>
for batch in dataloader:
print(batch)
Error is:
Traceback (most recent call last):
File “”, line 1, in
File “/data/leshekha/lib/HFDatasets/lib/python3.6/site-packages/torch/utils/data/dataloader.py”, line 435, in next
data = self._next_data()
File “/data/leshekha/lib/HFDatasets/lib/python3.6/site-packages/torch/utils/data/dataloader.py”, line 474, in _next_data
index = self._next_index() # may raise StopIteration
File “/data/leshekha/lib/HFDatasets/lib/python3.6/site-packages/torch/utils/data/dataloader.py”, line 427, in _next_index
return next(self._sampler_iter) # may raise StopIteration
File “/data/leshekha/lib/HFDatasets/lib/python3.6/site-packages/torch/utils/data/sampler.py”, line 227, in iter
for idx in self.sampler:
File “/data/leshekha/lib/HFDatasets/lib/python3.6/site-packages/torch/utils/data/sampler.py”, line 67, in iter
return iter(range(len(self.data_source)))
TypeError: object of type ‘IterableDataset’ has no len()
Thanks,
Leena
Solution at Error iteration over IterableDataset using Torch DataLoader · Issue #2583 · huggingface/datasets · GitHub
Thank you for the solution and the link. I was just about to paste it here. This has been solved.