Iterable datasets for array data, limited formatting options

Iterable datasets have limited formatting options, which can be problematic/awkward when using array data. Right now, when applying a function to the dataset with ‘map’, the default Python formatter is used. This will convert an array to a Python list. This is incredibly slow when dealing with large (multi-dimensional) arrays.

It would be convenient if other formatters are supported, such as Numpy. The Arrow formatter is (partially) supported but in certain cases it appears to still use a python format for certain iterators, making it very slow. Furthermore, this requires more advanced knowledge of Apache Arrow.

The workaround that I applied, is to encode the arrays to a binary format. When loading the dataset as an iterable dataset the binary format is then first ‘decoded’ to a Numpy array prior to applying the transform. A torch format is applied to receive the processed data in the desired format.

— My questions:
Is there a better way to work with array data when using the iterable dataset? I am not using Image data, so using the image feature will not be of much use.
Are there plans to support other formatting options in the future?

You can use the NumPy formatting

ds = ds.with_format("numpy")

then you can iterate on it or map and get numpy arrays

for example_with_numpy_arrays in ds:
    ...

for map:

ds = ds.map(my_func_that_uses_numpy_arrays)

@lhoestq thank you for your response! I have a working solution, but I believe I couldn’t quite get it to work with the method you suggested. Perhaps I have some time to look into it later.