Streaming .arrow IterableDataset with irregular first dimension

I have a bunch of arrow files with the following feature:

        "readings": Array2D(
            dtype="float32", shape=(-1, length_seconds)
        )

Which can be individually loaded perfectly ok. However, it fails to stream and complains of this error:

...site-packages/datasets/features/features.py", line 760, in to_numpy
[rank11]:     numpy_arr = numpy_arr.reshape(len(self) - len(null_indices), *self.type.shape)
[rank11]:                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank11]: ValueError: cannot reshape array of size 2352000 into shape (10,newaxis,12000)

Digging around, it looks like ArrowExamplesIterable in datasets/iterable_dataset.py:L259 tries to pre-load batches of samples but assumes the table can directly be loaded in a batched manner:

                for pa_subtable in pa_table.to_reader(max_chunksize=config.ARROW_READER_BATCH_SIZE_IN_DATASET_ITER):

This is normally ok, but clearly won’t work for irregular first dimension data. My question is: Other than manually padding the data to be the same size, are there other methods around this? I prefer to do the padding in the collate_fn since it saves disc space and there’s mostly no speed difference.

1 Like

I think wit should be shape=(None, length_seconds), as per the documentation:

The array type also allows the first dimension of the array to be dynamic. This is useful for handling sequences with variable lengths such as sentences, without having to pad or truncate the input to a uniform shape.


>>> features = Features({'a': Array3D(shape=(None, 5, 2), dtype='int32')})

1 Like

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.