How to steaming .hf dataset

xDokiXDokix · November 6, 2024, 1:45pm

I’m trying to load .hf datasets using stream
The dataset is tuan124816/newcs2_data

dataset = load_dataset("tuan124816/newcs2_data",
                       streaming=True)
hf_dataset = dataset['test']

output

IterableDataset({
    features: Unknown,
    n_shards: 600
})

when print out the first element:

print(next(iter(hf_dataset)))

output

{'_data_files': [{'filename': 'data-00000-of-00001.arrow'}], '_fingerprint': '905978a8bab44335', '_format_columns': ['observations', 'actions', 'rewards'], '_format_kwargs': {}, '_format_type': None, '_output_all_columns': False, '_split': None}

Is this the right way to load this kind of dataset?
How can I read the data and know what inside [‘observations’, ‘actions’, ‘rewards’]?

John6666 · November 6, 2024, 3:20pm

This?

xDokiXDokix · November 6, 2024, 3:54pm

From what I see , the stream only load the state.json file from each .hf folder

state.json:

{
  "_data_files": [
    {
      "filename": "data-00000-of-00001.arrow"
    }
  ],
  "_fingerprint": "905978a8bab44335",
  "_format_columns": [
    "observations",
    "actions",
    "rewards"
  ],
  "_format_kwargs": {},
  "_format_type": null,
  "_output_all_columns": false,
  "_split": null
}

From my experience other dataset like imdb when stream alway have an clear output of dictionary with text and label. I’m confuse why the data don’t load the arrow file.
Am I doing something wrong?

xDokiXDokix · November 10, 2024, 3:13pm

I fix this , because of the dataset viewer on hugging face
I don’t know why but when I delete it from dataset then the stream work fine again.

John6666 · November 10, 2024, 3:34pm

Ah, it’s probably a bug in the library.
There are reports that sound like they might be, but I don’t think they’ve been merged.

https://github.com/huggingface/datasets/pulls

Does the behavior of the dataset viewer affect the behavior of the dataset library?
Does it lock the file or something…

lhoestq · November 30, 2024, 4:13pm

Hi ! Datasets with a state.json are datasets saved for local disk using save_to_disk() and are not supported by the Hub - though if you remove the json file to only keep the Arrow files it will work correctly.

We should change .save_to_disk() completely to save in a format that is supported by the Hub IMO

Topic		Replies	Views
Use load dataset to load a sample of the dataset 🤗Datasets	3	1265	May 24, 2021
Does huggingface support load raw text dataset from hdfs? 🤗Datasets	3	1280	January 9, 2022
Cannot stream custom dataset 🤗Datasets	1	537	October 11, 2023
NotImplementedError when loading dataset with Streamlit 🤗Datasets	8	10343	June 16, 2025
Roadmap/timeline for dataset streaming 🤗Datasets	9	2272	July 5, 2021

How to steaming .hf dataset

Related topics