Iterable datasets features


I’m new here and I don’t know if question is already resolved.

I’m using dataset with streaming=True and I see the dataset features are None . It’s a expected behaviour?

import datasets

ds = datasets.load_dataset("app_reviews", split="train", streaming=True)

Thanks in advance!

Hi ! When running your code with datasets 1.18.4 I get

  'package_name': Value(dtype='string', id=None),
  'review': Value(dtype='string', id=None),
  'date': Value(dtype='string', id=None),
  'star': Value(dtype='int8', id=None)

However it can sometimes happen that the features are None when you load a dataset in streaming mode.

For example, if you load some CSV data in streaming mode, then you actually need to start streaming some data in order to infer the types are going to have each column. load_dataset currently doesn’t prefetch the data to infer the types in streaming mode (though we’re discussing it here: IterableDataset columns and feature types · Issue #3888 · huggingface/datasets · GitHub)

Prefetching can take a few seconds and maybe more if your dataset has lots of NaNs, because in this case you would even have to stream the dataset until you find a non-null example in order to infer the feature types.

I hope that answers the question, and let me know if you have an opinion regarding prefetching to infer the feature types in streaming mode :slight_smile:

Thanks for the response. I’ve tested with a new clean environment and it works.

I understand the need to start streaming data to infer the types, and I think it fits better. Maybe, it could be nice showing a warning message with some info about how to infer the features when they are still None for an IterableDataset

Again, thanks for your help!

1 Like