Proprietary database load error: TypeError: Argument 'storage' has incorrect type (expected pyarrow.lib.Array, got pyarrow.lib.ChunkedArray)

Sethn · January 24, 2022, 1:34pm

I’m currently trying to load in a proprietary dataset from disk. When the script finishes loading the train split I recieve the error: TypeError: Argument ‘storage’ has incorrect type (expected pyarrow.lib.Array, got pyarrow.lib.ChunkedArray).

The files themselves are a large set of videos. I load in each video using pims and extract a set of frames as PIL Images. These are then stored in dataset rows with a few other bits of info e.g. class-labels and filenames.

The script works fine when I extract a single frame from each video, and then generates the Typeerror with any more. I can only presume it is something to do with memory size, 10 frames per video uses ~16gb of my 32gb of memory.

Is there some way to do intermediate Arrow-file writes to lower memory overhead?
Alternatively is there some way to pass user defined variables to a load_dataset function, such that I can load the dataset multiple times extracting different frames and concatenate the resultant datasets?
Alternatively any ideas on how to fix the error? I’m using datasets 1.17.0
Thanks in advance

mariosasko · January 24, 2022, 7:29pm

Hi! Could you please copy and paste the entire stack trace?

Is there some way to do intermediate Arrow-file writes to lower memory overhead?

You can control RAM usage in dataset scripts with the DEFAULT_WRITER_BATCH_SIZE attribute of GeneratorBasedBuilder (as we do in this script).

The files themselves are a large set of videos. I load in each video using pims and extract a set of frames as PIL Images. These are then stored in dataset rows with a few other bits of info e.g. class-labels and filenames.

In datasets 1.18.0, we added support for nested decoding of the Image feature which is ideal for your use-case. To use it, just define the features dict as:

features = Features({
    "frames": Sequence(Image(), length=10),
    "meta": ...
})

and yield data as:

yield idx, {
    "frames": [pil_img_frame1, pil_img_frame2, ...],
    "meta": ...,
}

Sethn · January 25, 2022, 2:02pm

Thanks for the fast response. I’ll need a day or two to get the full stack trace, I’ll have to go back to an earlier version of the code. I found a hacky way around it by hardcoding a rather large amount of builderconfigs, one for each frame, and then concatenating the individually “configged” databases.

It’s also possible that this was fixed in 1.18, I had no idea there was a new version.

With regards to the new features in 1.18 they look perfect, I shall investigate!

Topic		Replies	Views
Arrowmemoryerror: realloc of size 32 GB failed 🤗Datasets	2	2257	January 6, 2023
ArrowTypeError: Expected bytes, got a 'float' object, when trying to make a dataset from a list of dicts 🤗Datasets	8	4496	October 17, 2023
Dataset too large error 🤗Datasets	1	625	March 15, 2023
Load_dataset using arrow datafiles + streaming gets an index error with the pytorch Dataloader 🤗Datasets	1	174	March 12, 2024
Error "TypeError: not a path-like object" when iterating through a streamed dataset 🤗Datasets	3	435	September 8, 2022

Proprietary database load error: TypeError: Argument 'storage' has incorrect type (expected pyarrow.lib.Array, got pyarrow.lib.ChunkedArray)

Related Topics