Roadmap/timeline for dataset streaming

Hi ! Most likely in 1.9 (next week maybe ?)

It adds a new datasets.IterableDataset object that you can load by passing streaming=True in load_dataset. You can iterate over it using a for loop for example.

You can use it to load your dataset, define your data processing using map and shuffle, and to train models.

After 1.9 we’ll also keep developing more features around streaming after the release. For example:

  • a nice integration with pytorch/tensorflow/jax
  • a conversion method to get back a map-style datasets.Dataset object.
  • additional methods like filter

And also improve the caching/buffering mechanism as well as the streaming from compressed data.

2 Likes