Streaming .arrow IterableDataset with irregular first dimension

Aceticia · February 14, 2025, 4:56am

I have a bunch of arrow files with the following feature:

        "readings": Array2D(
            dtype="float32", shape=(-1, length_seconds)
        )

Which can be individually loaded perfectly ok. However, it fails to stream and complains of this error:

...site-packages/datasets/features/features.py", line 760, in to_numpy
[rank11]:     numpy_arr = numpy_arr.reshape(len(self) - len(null_indices), *self.type.shape)
[rank11]:                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank11]: ValueError: cannot reshape array of size 2352000 into shape (10,newaxis,12000)

Digging around, it looks like ArrowExamplesIterable in datasets/iterable_dataset.py:L259 tries to pre-load batches of samples but assumes the table can directly be loaded in a batched manner:

                for pa_subtable in pa_table.to_reader(max_chunksize=config.ARROW_READER_BATCH_SIZE_IN_DATASET_ITER):

This is normally ok, but clearly won’t work for irregular first dimension data. My question is: Other than manually padding the data to be the same size, are there other methods around this? I prefer to do the padding in the collate_fn since it saves disc space and there’s mostly no speed difference.

lhoestq · February 14, 2025, 5:55pm

I think wit should be shape=(None, length_seconds), as per the documentation:

The array type also allows the first dimension of the array to be dynamic. This is useful for handling sequences with variable lengths such as sentences, without having to pad or truncate the input to a uniform shape.
>>> features = Features({'a': Array3D(shape=(None, 5, 2), dtype='int32')})

system · March 8, 2025, 9:36pm

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Iterable datasets for array data, limited formatting options 🤗Datasets	2	418	December 28, 2023
Dataset Viewer not available on features of type datasets.Array2D(shape=(None, 768), dtype='float64') 🤗Datasets	7	35	May 14, 2025
Significant performance difference between two shapes using Array2D features 🤗Datasets	3	345	September 5, 2023
Dataset set_format 🤗Datasets	11	10244	November 24, 2024
Load_dataset using arrow datafiles + streaming gets an index error with the pytorch Dataloader 🤗Datasets	1	355	March 12, 2024

Streaming .arrow IterableDataset with irregular first dimension

Related topics