I’m trying to load .hf datasets using stream
The dataset is tuan124816/newcs2_data
dataset = load_dataset("tuan124816/newcs2_data",
streaming=True)
hf_dataset = dataset['test']
output
IterableDataset({
features: Unknown,
n_shards: 600
})
when print out the first element:
print(next(iter(hf_dataset)))
output
{'_data_files': [{'filename': 'data-00000-of-00001.arrow'}], '_fingerprint': '905978a8bab44335', '_format_columns': ['observations', 'actions', 'rewards'], '_format_kwargs': {}, '_format_type': None, '_output_all_columns': False, '_split': None}
Is this the right way to load this kind of dataset?
How can I read the data and know what inside [‘observations’, ‘actions’, ‘rewards’]?
1 Like
From what I see , the stream only load the state.json file from each .hf folder
state.json:
{
"_data_files": [
{
"filename": "data-00000-of-00001.arrow"
}
],
"_fingerprint": "905978a8bab44335",
"_format_columns": [
"observations",
"actions",
"rewards"
],
"_format_kwargs": {},
"_format_type": null,
"_output_all_columns": false,
"_split": null
}
From my experience other dataset like imdb when stream alway have an clear output of dictionary with text and label. I’m confuse why the data don’t load the arrow file.
Am I doing something wrong?
1 Like
I fix this , because of the dataset viewer on hugging face
I don’t know why but when I delete it from dataset then the stream work fine again.
1 Like
Ah, it’s probably a bug in the library.
There are reports that sound like they might be, but I don’t think they’ve been merged.
https://github.com/huggingface/datasets/pulls
Does the behavior of the dataset viewer affect the behavior of the dataset library?
Does it lock the file or something…
1 Like