Hi ! Most likely in 1.9 (next week maybe ?)
It adds a new datasets.IterableDataset
object that you can load by passing streaming=True
in load_dataset
. You can iterate over it using a for
loop for example.
You can use it to load your dataset, define your data processing using map
and shuffle
, and to train models.
After 1.9 we’ll also keep developing more features around streaming after the release. For example:
- a nice integration with pytorch/tensorflow/jax
- a conversion method to get back a map-style
datasets.Dataset
object. - additional methods like
filter
And also improve the caching/buffering mechanism as well as the streaming from compressed data.