Iterable datasets features

frascuchon · March 10, 2022, 1:51pm

Hi,

I’m new here and I don’t know if question is already resolved.

I’m using dataset with streaming=True and I see the dataset features are None . It’s a expected behaviour?

import datasets

ds = datasets.load_dataset("app_reviews", split="train", streaming=True)
print(ds.features)

Thanks in advance!

lhoestq · March 11, 2022, 12:34pm

Hi ! When running your code with datasets 1.18.4 I get

{
  'package_name': Value(dtype='string', id=None),
  'review': Value(dtype='string', id=None),
  'date': Value(dtype='string', id=None),
  'star': Value(dtype='int8', id=None)
}

However it can sometimes happen that the features are None when you load a dataset in streaming mode.

For example, if you load some CSV data in streaming mode, then you actually need to start streaming some data in order to infer the types are going to have each column. load_dataset currently doesn’t prefetch the data to infer the types in streaming mode (though we’re discussing it here: IterableDataset columns and feature types · Issue #3888 · huggingface/datasets · GitHub)

Prefetching can take a few seconds and maybe more if your dataset has lots of NaNs, because in this case you would even have to stream the dataset until you find a non-null example in order to infer the feature types.

I hope that answers the question, and let me know if you have an opinion regarding prefetching to infer the feature types in streaming mode

frascuchon · March 11, 2022, 6:10pm

Thanks for the response. I’ve tested with a new clean environment and it works.

I understand the need to start streaming data to infer the types, and I think it fits better. Maybe, it could be nice showing a warning message with some info about how to infer the features when they are still None for an IterableDataset

Again, thanks for your help!

log0 · September 7, 2022, 7:19pm

Hello, I have the same problem I think.

data = load_dataset(“json”, data_files=dataset, split=“train”, streaming=True)
data.cast_column(“audio”, datasets.features.Audio(sampling_rate=16000))

fails with

File “test_predict.py”, line 104, in main
data.cast_column(“audio”, datasets.features.Audio(sampling_rate=16000))
File “/home/ugo/miniconda3/envs/audio/lib/python3.8/site-packages/datasets/iterable_dataset.py”, line 1191, in cast_column
info.features[column] = feature
TypeError: ‘NoneType’ object does not support item assignment

@lhoestq Could you describe what you mean by prefetching data ?

lhoestq · September 8, 2022, 12:32pm

Hi ! You can fix this by passing the features= argument to load_dataset with the type of all the columns

What I mean by prefetching is that you need to download the first examples of the dataset to know what type each column is

log0 · September 8, 2022, 5:04pm

Ok, Thanks ! makes sense.

For other people in the same case here is what I did to build the feature object. I loaded the same dataset (smaller) without streaming=True, then printed dataset.features and copy pasted it in the initialization of a new Feature() object that I passed to load_dataset with streaming=True. Something like this

print(smalldataset.features) # {'path': Value(dtype='string', id=None), ...}
features = Features({'path': Value(dtype='string', id=None), ...})
dataset = load_dataset(... , features=features, streaming=True)

Topic		Replies	Views
Describe a nullable/optional column in dataset loading script 🤗Datasets	3	1100	November 12, 2021
Roadmap/timeline for dataset streaming 🤗Datasets	9	2271	July 5, 2021
Datasets map issues 🤗Datasets	3	519	February 23, 2023
IterableDataset compute feature mean and create histogram 🤗Datasets	2	439	May 15, 2023
How do I set feature type when loading dataset(ClassLabel etc)? 🤗Datasets	2	2054	January 19, 2022

Iterable datasets features

Related topics