Setting dataset feature value as numpy array

aclifton314 · July 27, 2022, 9:43pm

Python: 3.9.7
Datsets: 2.1.0

I have a dataset where each example has a label and array-like sequence of floats associated with it. The dataset is very large and I have opted to create a loading script following these instructions. Specifically, my data looks something like this:

label     | data
--------------------------
'label_1' | [ -3.05175781e-05, 3.35693359e-04, -2.62451172e-03, 2.44140625e-03, ...]
'label_7' | [...]
.
.
.

where for each example the data column is a numpy array. When building the features for the _info method of the data loading script, I am not sure what to set as the value type for the data feature. Here is my current code:

def _info(self):
        features = datasets.Features(
            {
                'label': datasets.Value('string'), 
                'data': datasets.Value(???)
            }
        )

        return datasets.DatasetInfo(
            features=features
        )

What is recommended to use as the data type for the data feature I have? Is there another data type in datasets that is better suited for numpy arrays (i.e. not datasets.Value() but something like datasets.Sequence())?

Thank you in advance for your help! I really love the hugging face datasets library!

lhoestq · July 28, 2022, 9:56am

Hi ! You can use datasets.Sequence(datasets.Value("float32")). Since a dataset is simply a wrapper around an Arrow table, your numpy array will be converted to Arrow format anyway.

Though you can still set the format of the dataset to “np” to output numpy arrays

ds = ds.with_format("np")

aclifton314 · July 28, 2022, 3:42pm

@lhoestq Thank you for your response, it worked perfectly!!

aclifton314 · October 5, 2022, 7:32pm

@lhoestq One quick follow up. Suppose my data is now a numpy array of size 100x2, how should I define this feature data type in the loading script? Currently, I am getting this error:

Using custom data configuration default
Downloading and preparing dataset proto_data/default to /home/aclifton/.cache/huggingface/datasets/proto_data/default/0.0.0/e33b001c2bee045d8ad072bd018561ee193303716d8cdd062cefc3a83a8d655b...
Downloading data files: 100%|█████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 5035.18it/s]
Extracting data files: 100%|██████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 1416.04it/s]
Traceback (most recent call last):                     
  File "/home/aclifton/rf_fp/tmp.py", line 5, in <module>
    ds = load_dataset('/RAID/users/aclifton/rffp_datasets/proto_data_top_25_labels_data')
  File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/datasets/load.py", line 1691, in load_dataset
    builder_instance.download_and_prepare(
  File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/datasets/builder.py", line 605, in download_and_prepare
    self._download_and_prepare(
  File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/datasets/builder.py", line 1104, in _download_and_prepare
    super()._download_and_prepare(dl_manager, verify_infos, check_duplicate_keys=verify_infos)
  File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/datasets/builder.py", line 694, in _download_and_prepare
    self._prepare_split(split_generator, **prepare_split_kwargs)
  File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/datasets/builder.py", line 1095, in _prepare_split
    example = self.info.features.encode_example(record)
  File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/datasets/features/features.py", line 1356, in encode_example
    return encode_nested_example(self, example)
  File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/datasets/features/features.py", line 1007, in encode_nested_example
    return {k: encode_nested_example(sub_schema, sub_obj) for k, (sub_schema, sub_obj) in zip_dict(schema, obj)}
  File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/datasets/features/features.py", line 1007, in <dictcomp>
    return {k: encode_nested_example(sub_schema, sub_obj) for k, (sub_schema, sub_obj) in zip_dict(schema, obj)}
  File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/datasets/features/features.py", line 1047, in encode_nested_example
    return [encode_nested_example(schema.feature, o) for o in obj]
  File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/datasets/features/features.py", line 1047, in <listcomp>
    return [encode_nested_example(schema.feature, o) for o in obj]
  File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/datasets/features/features.py", line 1052, in encode_nested_example
    return schema.encode_example(obj) if obj is not None else None
  File "/home/aclifton/anaconda3/envs/rffp/lib/python3.9/site-packages/datasets/features/features.py", line 456, in encode_example
    return float(value)
TypeError: only size-1 arrays can be converted to Python scalars

lhoestq · October 6, 2022, 5:10pm

This was a bug in old versions of datasets I think.

You can try updating datasets pip install -U datasets

or you can convert your numpy array to a list: my_array.tolist()

Mohnik9777 · July 8, 2023, 4:02am

I’m having the same issue and updating or converting to list doesn’t work.

Trying to return the “inputs_embeds” which is a 2D numpy array (num_tokens, emb_size) but run into this error:

only size-1 arrays can be converted to Python scalars

Jaggz · November 7, 2023, 11:29pm

But how do you set just one feature as being ndarrays?
For instance, when load_dataset()'ing a mozilla common voice set (v11):

> tt = load_dataset(mozilla_model_name, and other stuff)
> type(tt.select([0])['audio'][0]['path'])
<class 'str'>
> type(tt.select([0])['audio'][0]['array'])
<class 'numpy.ndarray'>

But in my own code, with_format(‘np’) sets the top-level features all to numpy arrays:

> type(test_ds['path'][0])
<class 'numpy.str_'>

Test code to create a dataset and reload it to examine types:

#!/usr/bin/env python
# Trying to save and reload a numpy array to/from a huggingface dataset
# The type of the loaded array must be a numpy array()
from datasets import Dataset, Features, Array2D, Sequence, Value
import numpy as np

audio_arrays = [np.random.rand(16000), np.random.rand(16000)] 

features = Features({
  # Each audio contains a np array of audio data, and a path to the src audio file
  'audio': Sequence({
    #'array': Sequence(feature=Array2D(shape=(None,), dtype="float32")),
    'array': Sequence(feature=Value('float32')),
    'path': Value('string'),
  }),
  'path': Value('string'), # Path is redundant in common voice set also
})

ddata = {
    'path': [],        # This will be a list of strings
    'audio': [],       # This will be a list of dictionaries
}

ddata['path'] = ['/foo0/', '/bar0/'] # # ensures we see storage difference
ddata['audio'] = [
        {'array': audio_arrays[0], 'path': '/foo1/' },
        {'array': audio_arrays[1], 'path': '/bar1/', },
]
ds = Dataset.from_dict(ddata)
ds = ds.with_format('np')
ds.save_to_disk('/tmp/ds.ds') 

loaded_dataset = Dataset.load_from_disk('/tmp/ds.ds')
ld = loaded_dataset
au = ld['audio'][0]
ar = ld['audio'][0]['array']
print("Type of audio array:", type(ar))
print("Type of path:", type(ld['path'][0]))
print("Type of au path:", type(ld['audio'][0]['path']))
import ipdb; ipdb.set_trace(context=16); pass

lhoestq · November 14, 2023, 12:58pm

You can do

ds = ds.with_format("np", columns=["audio"], output_all_columns=True)

The audio column is then formatted as numpy arrays, and the other columns are unformatted

Topic		Replies	Views
Create dataset consisting of numpy arrays, Sequence or ArrayND? 🤗Datasets	1	151	October 24, 2024
Standard getitem returns wrong data type for arrays 🤗Datasets	2	433	November 17, 2023
How do I set feature type when loading dataset(ClassLabel etc)? 🤗Datasets	2	2065	January 19, 2022
Dataset set_format 🤗Datasets	11	10401	November 24, 2024
Export own dataset with different feature types to TFRecord 🤗Datasets	6	1340	April 17, 2023

Setting dataset feature value as numpy array

Related topics