Setting dataset feature value as numpy array

Python: 3.9.7
Datsets: 2.1.0

I have a dataset where each example has a label and array-like sequence of floats associated with it. The dataset is very large and I have opted to create a loading script following these instructions. Specifically, my data looks something like this:

label     | data
--------------------------
'label_1' | [ -3.05175781e-05, 3.35693359e-04, -2.62451172e-03, 2.44140625e-03, ...]
'label_7' | [...]
.
.
.

where for each example the data column is a numpy array. When building the features for the _info method of the data loading script, I am not sure what to set as the value type for the data feature. Here is my current code:

def _info(self):
        features = datasets.Features(
            {
                'label': datasets.Value('string'), 
                'data': datasets.Value(???)
            }
        )

        return datasets.DatasetInfo(
            features=features
        )

What is recommended to use as the data type for the data feature I have? Is there another data type in datasets that is better suited for numpy arrays (i.e. not datasets.Value() but something like datasets.Sequence())?

Thank you in advance for your help! I really love the hugging face datasets library!

Hi ! You can use datasets.Sequence(datasets.Value("float32")). Since a dataset is simply a wrapper around an Arrow table, your numpy array will be converted to Arrow format anyway.

Though you can still set the format of the dataset to “np” to output numpy arrays :slight_smile:

ds = ds.with_format("np")
1 Like

@lhoestq Thank you for your response, it worked perfectly!!