I have an h5 file which consists of two datasets. One is for metadata (labels and etc) and one is for the actual data which is a 2d array for each element. From my experience working on datasets, HF’s dataset has a very good caching mechanism and is usually way faster than plain vanilla dataset I can code via pure python. I wanted to know how can I create a HF dataset from this hdf5 file I have, and are there any examples for custom datasets?
The dataset loading script template has parts about downloading the data and split generators, which I don’t think are related to my current task.
Additionally. I also have to split the data based on conditions of the label of the data, and not pure randomly. How can I achieve that?
Hi. The problem I have is that my h5 is not compatible with pandas, since I’ve created it with h5py and didn’t followed the strict h5 format pandas has for h5 files.
My h5 file consists of two datasets, one is windows which is a numerical dataset of float32 windows of (18, 1024), around 1e6 windows which makes the total shape (1e6, 18, 1024).
I also have collected the json strings of each datapoint metadata in another dataset named metadata. The window’s label for example, which can be used for classification tasks later on.
Each index in the final dataset interface should read the data from both datasets at that index. Thanks!
For the first dataset, try to load the data in chunks with Dataset.from_pandas if you can somehow convert the data to be pandas-compatible, or with Dataset.from_dict if you can load the data into a dict. Then, store each chunk to disk (Dataset.save_to_disk) and reload the chunk (datasets.load_from_disk). Next, concatenate the loaded chunks with datasets.concatenate_datasets (set the axis param to 0). Feel free to follow my snippet from the above because the idea is the same.
For the second dataset, load the JSON strings with Dataset.from_json. This dataset will already be on disk, so you don’t have to worry about RAM usage.
Finally, concatenate these two datasets as follows: