Iterable datasets have limited formatting options, which can be problematic/awkward when using array data. Right now, when applying a function to the dataset with ‘map’, the default Python formatter is used. This will convert an array to a Python list. This is incredibly slow when dealing with large (multi-dimensional) arrays.
It would be convenient if other formatters are supported, such as Numpy. The Arrow formatter is (partially) supported but in certain cases it appears to still use a python format for certain iterators, making it very slow. Furthermore, this requires more advanced knowledge of Apache Arrow.
The workaround that I applied, is to encode the arrays to a binary format. When loading the dataset as an iterable dataset the binary format is then first ‘decoded’ to a Numpy array prior to applying the transform. A torch format is applied to receive the processed data in the desired format.
— My questions:
Is there a better way to work with array data when using the iterable dataset? I am not using Image data, so using the image feature will not be of much use.
Are there plans to support other formatting options in the future?