How to steaming .hf dataset

I’m trying to load .hf datasets using stream
The dataset is tuan124816/newcs2_data

dataset = load_dataset("tuan124816/newcs2_data",
                       streaming=True)
hf_dataset = dataset['test']

output

IterableDataset({
    features: Unknown,
    n_shards: 600
})

when print out the first element:

print(next(iter(hf_dataset)))

output

{'_data_files': [{'filename': 'data-00000-of-00001.arrow'}], '_fingerprint': '905978a8bab44335', '_format_columns': ['observations', 'actions', 'rewards'], '_format_kwargs': {}, '_format_type': None, '_output_all_columns': False, '_split': None}

Is this the right way to load this kind of dataset?
How can I read the data and know what inside [‘observations’, ‘actions’, ‘rewards’]?

1 Like

This?

From what I see , the stream only load the state.json file from each .hf folder

state.json:

{
  "_data_files": [
    {
      "filename": "data-00000-of-00001.arrow"
    }
  ],
  "_fingerprint": "905978a8bab44335",
  "_format_columns": [
    "observations",
    "actions",
    "rewards"
  ],
  "_format_kwargs": {},
  "_format_type": null,
  "_output_all_columns": false,
  "_split": null
}

From my experience other dataset like imdb when stream alway have an clear output of dictionary with text and label. I’m confuse why the data don’t load the arrow file.
Am I doing something wrong?

1 Like

I fix this , because of the dataset viewer on hugging face
I don’t know why but when I delete it from dataset then the stream work fine again.

1 Like

Ah, it’s probably a bug in the library.
There are reports that sound like they might be, but I don’t think they’ve been merged.

https://github.com/huggingface/datasets/pulls

Does the behavior of the dataset viewer affect the behavior of the dataset library?
Does it lock the file or something…

1 Like

Hi ! Datasets with a state.json are datasets saved for local disk using save_to_disk() and are not supported by the Hub - though if you remove the json file to only keep the Arrow files it will work correctly.

We should change .save_to_disk() completely to save in a format that is supported by the Hub IMO

1 Like