Setting dataset feature value as numpy array

Python: 3.9.7
Datsets: 2.1.0

I have a dataset where each example has a label and array-like sequence of floats associated with it. The dataset is very large and I have opted to create a loading script following these instructions. Specifically, my data looks something like this:

label     | data
--------------------------
'label_1' | [ -3.05175781e-05, 3.35693359e-04, -2.62451172e-03, 2.44140625e-03, ...]
'label_7' | [...]
.
.
.

where for each example the data column is a numpy array. When building the features for the _info method of the data loading script, I am not sure what to set as the value type for the data feature. Here is my current code:

def _info(self):
        features = datasets.Features(
            {
                'label': datasets.Value('string'), 
                'data': datasets.Value(???)
            }
        )

        return datasets.DatasetInfo(
            features=features
        )

What is recommended to use as the data type for the data feature I have? Is there another data type in datasets that is better suited for numpy arrays (i.e. not datasets.Value() but something like datasets.Sequence())?

Thank you in advance for your help! I really love the hugging face datasets library!

Hi ! You can use datasets.Sequence(datasets.Value("float32")). Since a dataset is simply a wrapper around an Arrow table, your numpy array will be converted to Arrow format anyway.

Though you can still set the format of the dataset to β€œnp” to output numpy arrays :slight_smile:

ds = ds.with_format("np")
1 Like

@lhoestq Thank you for your response, it worked perfectly!!

@lhoestq One quick follow up. Suppose my data is now a numpy array of size 100x2, how should I define this feature data type in the loading script? Currently, I am getting this error:

Using custom data configuration default
Downloading and preparing dataset proto_data/default to /home/aclifton/.cache/huggingface/datasets/proto_data/default/0.0.0/e33b001c2bee045d8ad072bd018561ee193303716d8cdd062cefc3a83a8d655b...
Downloading data files: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 5035.18it/s]
Extracting data files: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 1416.04it/s]
Traceback (most recent call last):                     
  File "/home/aclifton/rf_fp/tmp.py", line 5, in <module>
    ds = load_dataset('/RAID/users/aclifton/rffp_datasets/proto_data_top_25_labels_data')
  File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/datasets/load.py", line 1691, in load_dataset
    builder_instance.download_and_prepare(
  File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/datasets/builder.py", line 605, in download_and_prepare
    self._download_and_prepare(
  File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/datasets/builder.py", line 1104, in _download_and_prepare
    super()._download_and_prepare(dl_manager, verify_infos, check_duplicate_keys=verify_infos)
  File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/datasets/builder.py", line 694, in _download_and_prepare
    self._prepare_split(split_generator, **prepare_split_kwargs)
  File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/datasets/builder.py", line 1095, in _prepare_split
    example = self.info.features.encode_example(record)
  File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/datasets/features/features.py", line 1356, in encode_example
    return encode_nested_example(self, example)
  File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/datasets/features/features.py", line 1007, in encode_nested_example
    return {k: encode_nested_example(sub_schema, sub_obj) for k, (sub_schema, sub_obj) in zip_dict(schema, obj)}
  File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/datasets/features/features.py", line 1007, in <dictcomp>
    return {k: encode_nested_example(sub_schema, sub_obj) for k, (sub_schema, sub_obj) in zip_dict(schema, obj)}
  File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/datasets/features/features.py", line 1047, in encode_nested_example
    return [encode_nested_example(schema.feature, o) for o in obj]
  File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/datasets/features/features.py", line 1047, in <listcomp>
    return [encode_nested_example(schema.feature, o) for o in obj]
  File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/datasets/features/features.py", line 1052, in encode_nested_example
    return schema.encode_example(obj) if obj is not None else None
  File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/datasets/features/features.py", line 456, in encode_example
    return float(value)
TypeError: only size-1 arrays can be converted to Python scalars

This was a bug in old versions of datasets I think.

You can try updating datasets pip install -U datasets

or you can convert your numpy array to a list: my_array.tolist()

1 Like