Proprietary database load error: TypeError: Argument 'storage' has incorrect type (expected pyarrow.lib.Array, got pyarrow.lib.ChunkedArray)

Iā€™m currently trying to load in a proprietary dataset from disk. When the script finishes loading the train split I recieve the error: TypeError: Argument ā€˜storageā€™ has incorrect type (expected pyarrow.lib.Array, got pyarrow.lib.ChunkedArray).

The files themselves are a large set of videos. I load in each video using pims and extract a set of frames as PIL Images. These are then stored in dataset rows with a few other bits of info e.g. class-labels and filenames.

The script works fine when I extract a single frame from each video, and then generates the Typeerror with any more. I can only presume it is something to do with memory size, 10 frames per video uses ~16gb of my 32gb of memory.

Is there some way to do intermediate Arrow-file writes to lower memory overhead?
Alternatively is there some way to pass user defined variables to a load_dataset function, such that I can load the dataset multiple times extracting different frames and concatenate the resultant datasets?
Alternatively any ideas on how to fix the error? Iā€™m using datasets 1.17.0
Thanks in advance

Hi! Could you please copy and paste the entire stack trace?

Is there some way to do intermediate Arrow-file writes to lower memory overhead?

You can control RAM usage in dataset scripts with the DEFAULT_WRITER_BATCH_SIZE attribute of GeneratorBasedBuilder (as we do in this script).

The files themselves are a large set of videos. I load in each video using pims and extract a set of frames as PIL Images. These are then stored in dataset rows with a few other bits of info e.g. class-labels and filenames.

In datasets 1.18.0, we added support for nested decoding of the Image feature which is ideal for your use-case. To use it, just define the features dict as:

features = Features({
    "frames": Sequence(Image(), length=10),
    "meta": ...
}) 

and yield data as:

yield idx, {
    "frames": [pil_img_frame1, pil_img_frame2, ...],
    "meta": ...,
}
1 Like

Thanks for the fast response. Iā€™ll need a day or two to get the full stack trace, Iā€™ll have to go back to an earlier version of the code. I found a hacky way around it by hardcoding a rather large amount of builderconfigs, one for each frame, and then concatenating the individually ā€œconfiggedā€ databases.

Itā€™s also possible that this was fixed in 1.18, I had no idea there was a new version.

With regards to the new features in 1.18 they look perfect, I shall investigate!