I have a bunch of arrow files with the following feature:
"readings": Array2D(
dtype="float32", shape=(-1, length_seconds)
)
Which can be individually loaded perfectly ok. However, it fails to stream and complains of this error:
...site-packages/datasets/features/features.py", line 760, in to_numpy
[rank11]: numpy_arr = numpy_arr.reshape(len(self) - len(null_indices), *self.type.shape)
[rank11]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank11]: ValueError: cannot reshape array of size 2352000 into shape (10,newaxis,12000)
Digging around, it looks like ArrowExamplesIterable
in datasets/iterable_dataset.py:L259
tries to pre-load batches of samples but assumes the table can directly be loaded in a batched manner:
for pa_subtable in pa_table.to_reader(max_chunksize=config.ARROW_READER_BATCH_SIZE_IN_DATASET_ITER):
This is normally ok, but clearly won’t work for irregular first dimension data. My question is: Other than manually padding the data to be the same size, are there other methods around this? I prefer to do the padding in the collate_fn since it saves disc space and there’s mostly no speed difference.