However it can sometimes happen that the features are None when you load a dataset in streaming mode.
For example, if you load some CSV data in streaming mode, then you actually need to start streaming some data in order to infer the types are going to have each column. load_dataset currently doesnāt prefetch the data to infer the types in streaming mode (though weāre discussing it here: IterableDataset columns and feature types Ā· Issue #3888 Ā· huggingface/datasets Ā· GitHub)
Prefetching can take a few seconds and maybe more if your dataset has lots of NaNs, because in this case you would even have to stream the dataset until you find a non-null example in order to infer the feature types.
I hope that answers the question, and let me know if you have an opinion regarding prefetching to infer the feature types in streaming mode
Thanks for the response. Iāve tested with a new clean environment and it works.
I understand the need to start streaming data to infer the types, and I think it fits better. Maybe, it could be nice showing a warning message with some info about how to infer the features when they are still None for an IterableDataset
data = load_dataset(ājsonā, data_files=dataset, split=ātrainā, streaming=True)
data.cast_column(āaudioā, datasets.features.Audio(sampling_rate=16000))
fails with
File ātest_predict.pyā, line 104, in main
data.cast_column(āaudioā, datasets.features.Audio(sampling_rate=16000))
File ā/home/ugo/miniconda3/envs/audio/lib/python3.8/site-packages/datasets/iterable_dataset.pyā, line 1191, in cast_column
info.features[column] = feature
TypeError: āNoneTypeā object does not support item assignment
@lhoestq Could you describe what you mean by prefetching data ?
For other people in the same case here is what I did to build the feature object. I loaded the same dataset (smaller) without streaming=True, then printed dataset.features and copy pasted it in the initialization of a new Feature() object that I passed to load_dataset with streaming=True. Something like this