Iterable datasets for array data, limited formatting options

Rmko4 · November 18, 2023, 2:57pm

Iterable datasets have limited formatting options, which can be problematic/awkward when using array data. Right now, when applying a function to the dataset with ‘map’, the default Python formatter is used. This will convert an array to a Python list. This is incredibly slow when dealing with large (multi-dimensional) arrays.

It would be convenient if other formatters are supported, such as Numpy. The Arrow formatter is (partially) supported but in certain cases it appears to still use a python format for certain iterators, making it very slow. Furthermore, this requires more advanced knowledge of Apache Arrow.

The workaround that I applied, is to encode the arrays to a binary format. When loading the dataset as an iterable dataset the binary format is then first ‘decoded’ to a Numpy array prior to applying the transform. A torch format is applied to receive the processed data in the desired format.

— My questions:
Is there a better way to work with array data when using the iterable dataset? I am not using Image data, so using the image feature will not be of much use.
Are there plans to support other formatting options in the future?

lhoestq · November 29, 2023, 5:08pm

You can use the NumPy formatting

ds = ds.with_format("numpy")

then you can iterate on it or map and get numpy arrays

for example_with_numpy_arrays in ds:
    ...

for map:

ds = ds.map(my_func_that_uses_numpy_arrays)

Rmko4 · December 28, 2023, 7:55pm

@lhoestq thank you for your response! I have a working solution, but I believe I couldn’t quite get it to work with the method you suggested. Perhaps I have some time to look into it later.

Topic		Replies	Views
Streaming .arrow IterableDataset with irregular first dimension 🤗Datasets	2	16	February 14, 2025
Dataset set_format 🤗Datasets	11	10310	November 24, 2024
Slow DataLoader with big batch_size 🤗Datasets	4	1731	October 5, 2023
Dataset.map saves list as numpy array instead of as list 🤗Datasets	2	1415	January 3, 2023
Compatibility for numpy arrays 🤗Datasets	7	5516	April 27, 2021

Iterable datasets for array data, limited formatting options

Related topics