Create HF dataset from h5

Hi.

I have an h5 file which consists of two datasets. One is for metadata (labels and etc) and one is for the actual data which is a 2d array for each element. From my experience working on datasets, HF’s dataset has a very good caching mechanism and is usually way faster than plain vanilla dataset I can code via pure python. I wanted to know how can I create a HF dataset from this hdf5 file I have, and are there any examples for custom datasets?

The dataset loading script template has parts about downloading the data and split generators, which I don’t think are related to my current task.

Additionally. I also have to split the data based on conditions of the label of the data, and not pure randomly. How can I achieve that?

Hi,

feel free to create a feature request for loading a dataset from hdf5 files.

In the meantime, you can load a dataset using pandas as follows:

import pandas as pd
import datasets

dset = datasets.Dataset.from_pandas(pd.read_hdf(hdf5_file))

Note that this will load a dataset in memory. To use caching when applying transforms, the dataset has to be stored on disk, so do the following:

dset = datasets.load_from_disk(dset.save_to_disk(save_dir))

or, if you want to keep the dataset in memory, specify cache_file_name when calling transforms.

If the dataset is too big to fit in memory when loading from the hdf5 file, try to load it in chunks:

import pandas as pd
import datasets

CHUNKSIZE = 10_000
SAVE_DIR = "dummy_dir"

dsets = []
for i, df_chunk in enumerate(pd.read_hdf(hdf5_file, iterator=True, chunksize=CHUNKSIZE)):
    dset = datasets.Dataset.from_pandas(df_chunk)
    dset = datasets.load_from_disk(dset.save_to_disk(f"{SAVE_DIR}-{i}"))
    dsets.append(dset)

dset = datasets.concatenate_datasets(dsets)

Let me know if this works for you.

1 Like

Hi. The problem I have is that my h5 is not compatible with pandas, since I’ve created it with h5py and didn’t followed the strict h5 format pandas has for h5 files.

My h5 file consists of two datasets, one is windows which is a numerical dataset of float32 windows of (18, 1024), around 1e6 windows which makes the total shape (1e6, 18, 1024).

I also have collected the json strings of each datapoint metadata in another dataset named metadata. The window’s label for example, which can be used for classification tasks later on.

Each index in the final dataset interface should read the data from both datasets at that index. Thanks!

Hi,

For the first dataset, try to load the data in chunks with Dataset.from_pandas if you can somehow convert the data to be pandas-compatible, or with Dataset.from_dict if you can load the data into a dict. Then, store each chunk to disk (Dataset.save_to_disk) and reload the chunk (datasets.load_from_disk). Next, concatenate the loaded chunks with datasets.concatenate_datasets (set the axis param to 0). Feel free to follow my snippet from the above because the idea is the same.

For the second dataset, load the JSON strings with Dataset.from_json. This dataset will already be on disk, so you don’t have to worry about RAM usage.

Finally, concatenate these two datasets as follows:

datasets.concatenate_dataset([windows_dataset, metadata_dataset], axis=1)
1 Like