Iterable datasets features

Hi,

Iā€™m new here and I donā€™t know if question is already resolved.

Iā€™m using dataset with streaming=True and I see the dataset features are None . Itā€™s a expected behaviour?

import datasets

ds = datasets.load_dataset("app_reviews", split="train", streaming=True)
print(ds.features)

Thanks in advance!

Hi ! When running your code with datasets 1.18.4 I get

{
  'package_name': Value(dtype='string', id=None),
  'review': Value(dtype='string', id=None),
  'date': Value(dtype='string', id=None),
  'star': Value(dtype='int8', id=None)
}

However it can sometimes happen that the features are None when you load a dataset in streaming mode.

For example, if you load some CSV data in streaming mode, then you actually need to start streaming some data in order to infer the types are going to have each column. load_dataset currently doesnā€™t prefetch the data to infer the types in streaming mode (though weā€™re discussing it here: IterableDataset columns and feature types Ā· Issue #3888 Ā· huggingface/datasets Ā· GitHub)

Prefetching can take a few seconds and maybe more if your dataset has lots of NaNs, because in this case you would even have to stream the dataset until you find a non-null example in order to infer the feature types.

I hope that answers the question, and let me know if you have an opinion regarding prefetching to infer the feature types in streaming mode :slight_smile:

Thanks for the response. Iā€™ve tested with a new clean environment and it works.

I understand the need to start streaming data to infer the types, and I think it fits better. Maybe, it could be nice showing a warning message with some info about how to infer the features when they are still None for an IterableDataset

Again, thanks for your help!

1 Like

Hello, I have the same problem I think.

data = load_dataset(ā€œjsonā€, data_files=dataset, split=ā€œtrainā€, streaming=True)
data.cast_column(ā€œaudioā€, datasets.features.Audio(sampling_rate=16000))

fails with

File ā€œtest_predict.pyā€, line 104, in main
data.cast_column(ā€œaudioā€, datasets.features.Audio(sampling_rate=16000))
File ā€œ/home/ugo/miniconda3/envs/audio/lib/python3.8/site-packages/datasets/iterable_dataset.pyā€, line 1191, in cast_column
info.features[column] = feature
TypeError: ā€˜NoneTypeā€™ object does not support item assignment

@lhoestq Could you describe what you mean by prefetching data ?

1 Like

Hi ! You can fix this by passing the features= argument to load_dataset with the type of all the columns

What I mean by prefetching is that you need to download the first examples of the dataset to know what type each column is

Ok, Thanks ! makes sense.

For other people in the same case here is what I did to build the feature object. I loaded the same dataset (smaller) without streaming=True, then printed dataset.features and copy pasted it in the initialization of a new Feature() object that I passed to load_dataset with streaming=True. Something like this

print(smalldataset.features) # {'path': Value(dtype='string', id=None), ...}
features = Features({'path': Value(dtype='string', id=None), ...})
dataset = load_dataset(... , features=features, streaming=True)

1 Like