Create dataset consisting of numpy arrays, Sequence or ArrayND?

I want to convert my custom dataset into huggingface dataset by creating it from from_generator first and then save_to_disk so that I can load_from_disk later.

The generator function returns the following:

# Each sample can have different number for dimension 0
# like mentioned in https://huggingface.co/docs/datasets/v3.0.2/en/about_dataset_features#dataset-features
# where it is suggested to use None like: features = Features({'a': Array3D(shape=(None, 5, 2), dtype='int32')})
def generatorFn():
    ...
    sample = {
               "example" : np.ones([4,5]), # 2D numpy array
               "label" : np.ones([10,5]), # 2D numpy array but dim 0 might not be common
               "numbers" : np.ones([7]), # 1D numpy array
               "has_noise" : True, # bool
              }
    yield sample

What would be the recommended types be in the features? Is it Sequence or Array2D + Array1D ?

How to make sure I get the correct shape and dtype when loading back the dataset?

Addtional info


I have two large files, one which stores examples as rows and the other store labels along with some metadata as rows. For the example file, each row consists of 2D coordinates of dtype float32. I have saved them in the format 12.23,2332.343:54.45,767.767:... where comma is used to separate inside the 2D coords and colon to separate between 2D coords.
I have some processor functions to convert each row into numpy 2D array. I also have some other processor functions to read rows in the label file which has 2D coords along with some metadata.

I want to use hugging face datasets because it has very good compression and easy to use interface

1 Like

I went with the following approach. It would be great if anyone can check it.

from datasets import load_from_disk

class NumpyTransform:
    def __init__(self, features, arr_types=None, seq_types=None):
        self._feats = features
        self._arr_types = arr_types or list(self._feats.keys())
        self._seq_types = seq_types or []

    def __call__(self, batch):
        sample = {}
        for key, val in batch.items():
            if key in self._arr_types:
                val = np.asarray(val, dtype=self._feats[key].dtype)
            elif key in self._seq_types:
                val = np.asarray(val, dtype=self._feats[key].feature.dtype)
            sample[key] = val
        return sample

dataset = load_from_disk(data_dir, keep_in_memory=None)
dataset = dataset.with_transform(
    NumpyTransform(dataset.features,
                   arr_types=["example", "label", "coords_label"],
                   seq_types=["coords_num"])
)
1 Like