Setting dataset feature value as numpy array

Python: 3.9.7
Datsets: 2.1.0

I have a dataset where each example has a label and array-like sequence of floats associated with it. The dataset is very large and I have opted to create a loading script following these instructions. Specifically, my data looks something like this:

label     | data
--------------------------
'label_1' | [ -3.05175781e-05, 3.35693359e-04, -2.62451172e-03, 2.44140625e-03, ...]
'label_7' | [...]
.
.
.

where for each example the data column is a numpy array. When building the features for the _info method of the data loading script, I am not sure what to set as the value type for the data feature. Here is my current code:

def _info(self):
        features = datasets.Features(
            {
                'label': datasets.Value('string'), 
                'data': datasets.Value(???)
            }
        )

        return datasets.DatasetInfo(
            features=features
        )

What is recommended to use as the data type for the data feature I have? Is there another data type in datasets that is better suited for numpy arrays (i.e. not datasets.Value() but something like datasets.Sequence())?

Thank you in advance for your help! I really love the hugging face datasets library!

1 Like

Hi ! You can use datasets.Sequence(datasets.Value("float32")). Since a dataset is simply a wrapper around an Arrow table, your numpy array will be converted to Arrow format anyway.

Though you can still set the format of the dataset to β€œnp” to output numpy arrays :slight_smile:

ds = ds.with_format("np")
2 Likes

@lhoestq Thank you for your response, it worked perfectly!!

@lhoestq One quick follow up. Suppose my data is now a numpy array of size 100x2, how should I define this feature data type in the loading script? Currently, I am getting this error:

Using custom data configuration default
Downloading and preparing dataset proto_data/default to /home/aclifton/.cache/huggingface/datasets/proto_data/default/0.0.0/e33b001c2bee045d8ad072bd018561ee193303716d8cdd062cefc3a83a8d655b...
Downloading data files: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 5035.18it/s]
Extracting data files: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 1416.04it/s]
Traceback (most recent call last):                     
  File "/home/aclifton/rf_fp/tmp.py", line 5, in <module>
    ds = load_dataset('/RAID/users/aclifton/rffp_datasets/proto_data_top_25_labels_data')
  File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/datasets/load.py", line 1691, in load_dataset
    builder_instance.download_and_prepare(
  File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/datasets/builder.py", line 605, in download_and_prepare
    self._download_and_prepare(
  File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/datasets/builder.py", line 1104, in _download_and_prepare
    super()._download_and_prepare(dl_manager, verify_infos, check_duplicate_keys=verify_infos)
  File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/datasets/builder.py", line 694, in _download_and_prepare
    self._prepare_split(split_generator, **prepare_split_kwargs)
  File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/datasets/builder.py", line 1095, in _prepare_split
    example = self.info.features.encode_example(record)
  File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/datasets/features/features.py", line 1356, in encode_example
    return encode_nested_example(self, example)
  File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/datasets/features/features.py", line 1007, in encode_nested_example
    return {k: encode_nested_example(sub_schema, sub_obj) for k, (sub_schema, sub_obj) in zip_dict(schema, obj)}
  File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/datasets/features/features.py", line 1007, in <dictcomp>
    return {k: encode_nested_example(sub_schema, sub_obj) for k, (sub_schema, sub_obj) in zip_dict(schema, obj)}
  File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/datasets/features/features.py", line 1047, in encode_nested_example
    return [encode_nested_example(schema.feature, o) for o in obj]
  File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/datasets/features/features.py", line 1047, in <listcomp>
    return [encode_nested_example(schema.feature, o) for o in obj]
  File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/datasets/features/features.py", line 1052, in encode_nested_example
    return schema.encode_example(obj) if obj is not None else None
  File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/datasets/features/features.py", line 456, in encode_example
    return float(value)
TypeError: only size-1 arrays can be converted to Python scalars

This was a bug in old versions of datasets I think.

You can try updating datasets pip install -U datasets

or you can convert your numpy array to a list: my_array.tolist()

1 Like

I’m having the same issue and updating or converting to list doesn’t work.

Trying to return the β€œinputs_embeds” which is a 2D numpy array (num_tokens, emb_size) but run into this error:

only size-1 arrays can be converted to Python scalars
1 Like

But how do you set just one feature as being ndarrays?
For instance, when load_dataset()'ing a mozilla common voice set (v11):

> tt = load_dataset(mozilla_model_name, and other stuff)
> type(tt.select([0])['audio'][0]['path'])
<class 'str'>
> type(tt.select([0])['audio'][0]['array'])
<class 'numpy.ndarray'>

But in my own code, with_format(β€˜np’) sets the top-level features all to numpy arrays:

> type(test_ds['path'][0])
<class 'numpy.str_'>

Test code to create a dataset and reload it to examine types:

#!/usr/bin/env python
# Trying to save and reload a numpy array to/from a huggingface dataset
# The type of the loaded array must be a numpy array()
from datasets import Dataset, Features, Array2D, Sequence, Value
import numpy as np

audio_arrays = [np.random.rand(16000), np.random.rand(16000)] 

features = Features({
  # Each audio contains a np array of audio data, and a path to the src audio file
  'audio': Sequence({
    #'array': Sequence(feature=Array2D(shape=(None,), dtype="float32")),
    'array': Sequence(feature=Value('float32')),
    'path': Value('string'),
  }),
  'path': Value('string'), # Path is redundant in common voice set also
})

ddata = {
    'path': [],        # This will be a list of strings
    'audio': [],       # This will be a list of dictionaries
}

ddata['path'] = ['/foo0/', '/bar0/'] # # ensures we see storage difference
ddata['audio'] = [
        {'array': audio_arrays[0], 'path': '/foo1/' },
        {'array': audio_arrays[1], 'path': '/bar1/', },
]
ds = Dataset.from_dict(ddata)
ds = ds.with_format('np')
ds.save_to_disk('/tmp/ds.ds') 

loaded_dataset = Dataset.load_from_disk('/tmp/ds.ds')
ld = loaded_dataset
au = ld['audio'][0]
ar = ld['audio'][0]['array']
print("Type of audio array:", type(ar))
print("Type of path:", type(ld['path'][0]))
print("Type of au path:", type(ld['audio'][0]['path']))
import ipdb; ipdb.set_trace(context=16); pass

You can do

ds = ds.with_format("np", columns=["audio"], output_all_columns=True)

The audio column is then formatted as numpy arrays, and the other columns are unformatted

1 Like