Create dataset consisting of numpy arrays, Sequence or ArrayND?

Raz0rr · October 24, 2024, 9:04am

I want to convert my custom dataset into huggingface dataset by creating it from from_generator first and then save_to_disk so that I can load_from_disk later.

The generator function returns the following:

# Each sample can have different number for dimension 0
# like mentioned in https://huggingface.co/docs/datasets/v3.0.2/en/about_dataset_features#dataset-features
# where it is suggested to use None like: features = Features({'a': Array3D(shape=(None, 5, 2), dtype='int32')})
def generatorFn():
    ...
    sample = {
               "example" : np.ones([4,5]), # 2D numpy array
               "label" : np.ones([10,5]), # 2D numpy array but dim 0 might not be common
               "numbers" : np.ones([7]), # 1D numpy array
               "has_noise" : True, # bool
              }
    yield sample

What would be the recommended types be in the features? Is it Sequence or Array2D + Array1D ?

How to make sure I get the correct shape and dtype when loading back the dataset?

Addtional info

I have two large files, one which stores examples as rows and the other store labels along with some metadata as rows. For the example file, each row consists of 2D coordinates of dtype float32. I have saved them in the format 12.23,2332.343:54.45,767.767:... where comma is used to separate inside the 2D coords and colon to separate between 2D coords.
I have some processor functions to convert each row into numpy 2D array. I also have some other processor functions to read rows in the label file which has 2D coords along with some metadata.

I want to use hugging face datasets because it has very good compression and easy to use interface

Raz0rr · October 24, 2024, 11:32am

I went with the following approach. It would be great if anyone can check it.

from datasets import load_from_disk

class NumpyTransform:
    def __init__(self, features, arr_types=None, seq_types=None):
        self._feats = features
        self._arr_types = arr_types or list(self._feats.keys())
        self._seq_types = seq_types or []

    def __call__(self, batch):
        sample = {}
        for key, val in batch.items():
            if key in self._arr_types:
                val = np.asarray(val, dtype=self._feats[key].dtype)
            elif key in self._seq_types:
                val = np.asarray(val, dtype=self._feats[key].feature.dtype)
            sample[key] = val
        return sample

dataset = load_from_disk(data_dir, keep_in_memory=None)
dataset = dataset.with_transform(
    NumpyTransform(dataset.features,
                   arr_types=["example", "label", "coords_label"],
                   seq_types=["coords_num"])
)

Topic		Replies	Views
Setting dataset feature value as numpy array 🤗Datasets	7	7917	November 14, 2023
Compatibility for numpy arrays 🤗Datasets	7	5554	April 27, 2021
Specifying a Sequence feature slows down the generation of a dataset 🤗Datasets	8	751	September 11, 2023
Dataset Viewer not available on features of type datasets.Array2D(shape=(None, 768), dtype='float64') 🤗Datasets	7	38	May 14, 2025
Standard getitem returns wrong data type for arrays 🤗Datasets	2	435	November 17, 2023

Create dataset consisting of numpy arrays, Sequence or ArrayND?

Addtional info

Related topics